This is the introduction to my final assignment for the open data science I will be working on the alcohol data set

Loading the data set and packages

#clear environment
rm(list=ls())

#read data
fpath<- "C:/Users/oyeda/Desktop/OPEN_DATA_SCIENCE/IODS-final/data/"
alco_data <- read.table(paste0(fpath, "alcohol_student.csv"), sep=",", header=T)
alco_data<- alco_data[,-1]

#necessary packages
#install.packages("dismo")
library(dplyr)
library(ggplot2)
library(corrplot)
library(GGally)
library(tidyr)
library(tidyverse)
library(gbm)
library(dismo)
library(caTools)
library(mgcv)
library(MASS)
library(FactoMineR)

AIMS AND OBJECTUIVES

My aim is to demonstrate the use of various statistical techniques in getting insight in to the dataset, to understand the patterns alcohol consumption consumptions and absences amongst students in two schools. Further, I seek to explore the distribution of grades amongst male and female students; and factors that might be affecting this. Firstly, I will use the basic desctiptive statitstics to understand the distribution, correlation and dependencies(cor, barplot, histogram, biplot etc) I will also be using the regression models(Generalized Linear Model(GLM), Generalised Additive Model(GAM) and Generalised Boosted Model(GBM)). Furthermore, I will test my models using 70:30 data split.

I will also explore Linear Discriminant Analysis(LDA) and Multiple Correspondence Analysis(MCA), to classify and categorise the data.

RESEARCH QUESTIONS

DATA EXPLORATION

categ = apply(alco_data, 2, function(x) nlevels(as.factor(x)))
categ
##     school        sex        age    address    famsize    Pstatus 
##          2          2          7          2          2          2 
##       Medu       Fedu       Mjob       Fjob     reason    nursery 
##          5          5          5          5          4          2 
##   internet   guardian traveltime  studytime   failures  schoolsup 
##          2          3          4          4          4          2 
##     famsup       paid activities     higher   romantic     famrel 
##          2          2          2          2          2          5 
##   freetime      goout       Dalc       Walc     health   absences 
##          5          5          5          5          5         26 
##         G1         G2         G3    alc_use   high_use 
##         14         15         18          9          2
dim(alco_data)
## [1] 382  35
str(alco_data)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
summary(alco_data)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 
#glimpse(alco_data)

# ####Grade:
#bar plot of the variables
gather(alco_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() 

To avoid, confusion, I will copy the data into a new data frame for MCA. Firstly, I will be extracting the categorical variables, as factors

#To avoid, confusion, I will copy the data into a new data frame for MCA
data_mca <- alco_data

#Firstly, I will be extracting the categorical variables, as factors
#now, filter them out for the MCA
data_mca_fac1<-Filter(is.factor, data_mca)

high_use<-factor(alco_data[,"high_use"])
data_mca_fac1 <- cbind.data.frame(data_mca_fac1, high_use)

MCA plot

#par(mfrow=c(1,2))
#now, perform the MCA on the catergorial/qualitative variables
mca_alco1 <- MCA(data_mca_fac1, graph = T)

#plot it
plot(mca_alco1, invisible=c("ind"), habillage="quali")

#par(mfrow=c(1,1))

BETTER VISUALISATION FOR MCA

# Getting a better biplot for MCA using ggplot
## number of categories per variable
categ = apply(data_mca_fac1, 2, function(x) nlevels(as.factor(x)))
categ
##     school        sex    address    famsize    Pstatus       Mjob 
##          2          2          2          2          2          5 
##       Fjob     reason    nursery   internet   guardian  schoolsup 
##          5          4          2          2          3          2 
##     famsup       paid activities     higher   romantic   high_use 
##          2          2          2          2          2          2
# data frame with variable coordinates
mca_vars_df = data.frame(mca_alco1$var$coord, Variable = rep(names(categ), categ))
#mca_vars_df


# data frame with observation coordinates
mca_obs_df = data.frame(mca_alco1$ind$coord)


# MCA plot of observations and categories
ggplot(data = mca_obs_df, aes(x = Dim.1, y = Dim.2)) +
  geom_hline(yintercept = 0, colour = "gray70") +
  geom_vline(xintercept = 0, colour = "gray70") +
  geom_point(colour = "gray50", alpha = 0.0) +
  geom_density2d(colour = "gray80") +
  geom_text(data = mca_vars_df, 
            aes(x = Dim.1, y = Dim.2, 
                label = rownames(mca_vars_df), colour = Variable)) +
  ggtitle("MCA plot of variables using R package FactoMineR") +
  scale_colour_discrete(name = "Variable")

Here, we can see that female students generally attend school because of her reputation and get more school support. They also have family support and are able to attend paid extra classes. Most of their father’s job are in the health sector. High alcohol consumption is more rampant amongst female students compared to their male counterparts with high alcohol use. male students also attend the school based on course preference and other reasons. By and large, they do not attend paid extra classes compared to the female students and also do not get family and school support like the frmale students. Lastly, they mosly have no intention of pursuing higher education. This is explored further by categorising more continuous variables such as grades, absences, to allow for comparison with other variables.

CATEGORISE MORE VARIABLES

Next, I will be categorising grade(G3), absences, Mother’s education(Medu), father’s education

#Next, I will be categorising grade(G3), absences, Mother's education(Medu),
#father's education
data_mca2<- alco_data

#now, convert mother and father's education level into something more understandable.
#data_mca$Medu<- factor(data_mca$Medu)
data_mca2$Medu<- factor(data_mca2$Medu, labels=c("no", "pr","5-9th", "sec", "hiE"))
data_mca2$Fedu<- factor(data_mca2$Medu, labels=c("no", "pr","5-9th", "sec", "hiE"))

#Now, let's categorise grades according to quartiles
bins_abs<- quantile(data_mca2$absences)
data_mca2$absences<-cut(data_mca2$absences, breaks=bins_abs, include.lowest=T,
                       labels=c("vlow", "low", "high", "vhigh"))

#same to grade(G3)
bins_gra<- quantile(data_mca2$G3)
data_mca2$G3<-cut(data_mca$G3, breaks=bins_gra, include.lowest=T,
                 labels=c("vlow", "low", "high", "vhigh"))

#Getting the columns that are factors
#since MCA is for categorical variables, I will be filtering the categorical
#variables/factors and also categorise some of the variables of interest
#such as absences, grade(G3). I will also Categorise other variables of interest that are in integer forms.

#let's first see the variables already in factor format
names(Filter(is.factor, data_mca2))
##  [1] "school"     "sex"        "address"    "famsize"    "Pstatus"   
##  [6] "Medu"       "Fedu"       "Mjob"       "Fjob"       "reason"    
## [11] "nursery"    "internet"   "guardian"   "schoolsup"  "famsup"    
## [16] "paid"       "activities" "higher"     "romantic"   "absences"  
## [21] "G3"
#now, filter them out for the MCA
data_mca_fac2_<-Filter(is.factor, data_mca2)

#include the high alcohol use column
high_use<-factor(alco_data[,"high_use"])

#join with the dataframe with categorical varianles
data_mca_fac2_<-cbind.data.frame(data_mca_fac2_, high_use)
str(data_mca_fac2_)
## 'data.frame':    382 obs. of  22 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : Factor w/ 5 levels "no","pr","5-9th",..: 2 2 3 3 4 4 4 3 4 4 ...
##  $ Fedu      : Factor w/ 5 levels "no","pr","5-9th",..: 2 2 3 3 4 4 4 3 4 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ absences  : Factor w/ 4 levels "vlow","low","high",..: 2 2 4 2 3 2 1 1 4 4 ...
##  $ G3        : Factor w/ 4 levels "vlow","low","high",..: 2 1 2 1 2 2 1 1 4 1 ...
##  $ high_use  : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 2 1 1 1 2 1 ...
#The above can also be done with dplyr which has been loaded already too
#data_mca %>% Filter(f = is.factor)

#names of the variables again ?
names(data_mca_fac2_) 
##  [1] "school"     "sex"        "address"    "famsize"    "Pstatus"   
##  [6] "Medu"       "Fedu"       "Mjob"       "Fjob"       "reason"    
## [11] "nursery"    "internet"   "guardian"   "schoolsup"  "famsup"    
## [16] "paid"       "activities" "higher"     "romantic"   "absences"  
## [21] "G3"         "high_use"
#Alternative for finding the column name that are factors
# names(data_)[ sapply(data_reg, is.factor) ]
# #or
# data_reg %>% Filter(f = is.factor) %>% names
# or
# data_reg %>% summarise_each(funs(is.factor)) %>% unlist %>% .[.] %>% names


#I will select few of the variables for clarity sake and include the sex too
data_mca_fac2<-data_mca_fac2_[,c("sex",names(data_mca_fac2_[,(10:ncol(data_mca_fac2_))]))]
    
#now, perform the MCA on the catergorial/qualitative variables
mca_alco2 <- MCA(data_mca_fac2, graph = T)

#plot it
plot(mca_alco2, invisible=c("ind"), habillage="quali")

# Getting a better biplot for MCA using ggplot
## number of categories per variable
categ2 = apply(data_mca_fac2, 2, function(x) nlevels(as.factor(x)))
categ2
##        sex     reason    nursery   internet   guardian  schoolsup 
##          2          4          2          2          3          2 
##     famsup       paid activities     higher   romantic   absences 
##          2          2          2          2          2          4 
##         G3   high_use 
##          4          2
# data frame with variable coordinates
mca_vars_df2 = data.frame(mca_alco2$var$coord, Variable = rep(names(categ2), categ2))
#mca_vars_df2


# data frame with observation coordinates
mca_obs_df2 = data.frame(mca_alco2$ind$coord)


# MCA plot of observations and categories
ggplot(data = mca_obs_df2, aes(x = Dim.1, y = Dim.2)) +
  geom_hline(yintercept = 0, colour = "gray70") +
  geom_vline(xintercept = 0, colour = "gray70") +
  geom_point(colour = "gray50", alpha = 0.1) +
  geom_density2d(colour = "gray80") +
  geom_text(data = mca_vars_df2, 
            aes(x = Dim.1, y = Dim.2, 
                label = rownames(mca_vars_df2), colour = Variable)) +
  ggtitle("MCA plot of variables") +
  scale_colour_discrete(name = "Variable")

Here, we can see the categorisation and association In the NW quadrant, it shows that those that pay for for extra classes in portugese and math, have family support and are mostly female students. They also get extra education support and have ambition for higher education.

In the NE quadrant, high alcohol use seem to be associated with guardian other than mother or father. Those with high alcohol use also seem to not have ambition for higher education. They also perform poorly in studies(low grade). and mostly choose school because it is closer to home amongst other reasons, other than reputation and course preference. They also do not seem to participate in extracurricular activities and are mostly in romantic relationship. They also high absences.

In the SW quadrant, it can be seen that those that have low absence perform very well in studies do not take excessive alcohol. They are also engaged in extracurricular activities and have their mother as their guardian. They also motivated to attend the school because of shool’s reputation.

In the SE quadrant, Male students seem to be more moderately absent and do not attend paid extra classes in portuguese and math. They mostly attend the school because of the course preference. They generally also have no internet access. They also mostly do not get school’s support. It also looks like their father are their guardian.

This will be further explored with family of regression models (GLM, GAM, GBM)

REGRESSION MODELS(DLM, GAM, GBM)

Before I proceed, there is need to determine the error distribution of the response variables here. As in many realistic cases, we do not have normal distribution(i.e gaussian), we could also have poisson distribution for count data and last is the binomial distribution (e.g Yes/No, Presence/Absence). That said, one of the assumptions of poisson distribution is that variance is greater than mean. Therefore, I will be creating a function to test this assumption and decide the kind of distribution. You can see more here

Create function to test variance>mean assumption of poisson distribution

#Create function to test variance>mean assumption of poisson distribution
test.var.mn<- function(x){
  if(var(x)>mean(x)){
    print("VALID:The variance>mean assumption of poisson distribution is valid")
  }
  else{
    print("INVALID:The variance>mean assumption of poisson distribution is invalid")
  }
}

TEST THE ASSUMPTION OF POISSON DISTRIBUTION ON AGE AND ABSENCES

#see if Grade(G3) meets the assumption
test.var.mn(alco_data$G3)
## [1] "INVALID:The variance>mean assumption of poisson distribution is invalid"
#next, see if "absences" meets the assumption
test.var.mn(alco_data$absences)
## [1] "VALID:The variance>mean assumption of poisson distribution is valid"

It shows here that it is invalid for grade(G3) but valid for “absences” variable. Next, therefore, I will be exploring various regression models, with grade(G3) as gaussian, Absences as poisson and high alcohol use(High_use) as binomial.

As done earlier, the data will be copied into a new dataframe for all the regression analysis, to avoid confusion and alteration of the original data.

For GLM, predictors will be selected, by firstly using stepwise regression to eliminate the redundant variables. Afterwards, I will be employing the significance test from the model summary, anova test(Chi Square), and AIC values. I also checked if any of the predictor is curvillinear(i.e has higher order polynomial).

#As done earlier, the data will be copied into a new dataframe for 
#all the regression analysis, to avoid confusion and alteration of 
#the original data
data_reg <- alco_data

#for GLM, predictors were selected, by employing the significance test from the model
#summary, anova test(Chi Square), AIC values and stepwise regression. I also checked 
#if any of the predictor is curvillinear(i.e has higher order polynomial)


grade_glm<-glm(G3~ sex + age+address+Medu+Fedu+
                 Pstatus+ traveltime+studytime+famsup+activities+higher+
                 internet+famrel+romantic+freetime+goout+ alc_use+ absences
               ,data=data_reg,family ="gaussian")

#First, I used the stepwise regression to eliminate the redundant variables.
stepAIC(grade_glm, direction = "both")
## Start:  AIC=1950.36
## G3 ~ sex + age + address + Medu + Fedu + Pstatus + traveltime + 
##     studytime + famsup + activities + higher + internet + famrel + 
##     romantic + freetime + goout + alc_use + absences
## 
##              Df Deviance    AIC
## - freetime    1   3322.6 1948.4
## - activities  1   3322.9 1948.4
## - Fedu        1   3325.3 1948.7
## - famrel      1   3326.5 1948.8
## - alc_use     1   3326.9 1948.9
## - age         1   3328.8 1949.1
## - absences    1   3330.2 1949.2
## - sex         1   3330.9 1949.3
## - Pstatus     1   3332.2 1949.5
## - traveltime  1   3336.1 1949.9
## - famsup      1   3336.2 1949.9
## <none>            3322.6 1950.4
## - internet    1   3340.6 1950.4
## - address     1   3342.1 1950.6
## - romantic    1   3344.4 1950.9
## - goout       1   3361.7 1952.8
## - Medu        1   3365.9 1953.3
## - studytime   1   3380.7 1955.0
## - higher      1   3494.9 1967.7
## 
## Step:  AIC=1948.37
## G3 ~ sex + age + address + Medu + Fedu + Pstatus + traveltime + 
##     studytime + famsup + activities + higher + internet + famrel + 
##     romantic + goout + alc_use + absences
## 
##              Df Deviance    AIC
## - activities  1   3322.9 1946.4
## - Fedu        1   3325.3 1946.7
## - famrel      1   3326.7 1946.8
## - alc_use     1   3326.9 1946.9
## - age         1   3328.9 1947.1
## - absences    1   3330.4 1947.3
## - sex         1   3331.3 1947.4
## - Pstatus     1   3332.2 1947.5
## - famsup      1   3336.2 1947.9
## - traveltime  1   3336.3 1947.9
## <none>            3322.6 1948.4
## - internet    1   3340.8 1948.5
## - address     1   3342.1 1948.6
## - romantic    1   3344.4 1948.9
## + freetime    1   3322.6 1950.4
## - goout       1   3363.4 1951.0
## - Medu        1   3366.0 1951.3
## - studytime   1   3381.0 1953.0
## - higher      1   3495.0 1965.7
## 
## Step:  AIC=1946.4
## G3 ~ sex + age + address + Medu + Fedu + Pstatus + traveltime + 
##     studytime + famsup + higher + internet + famrel + romantic + 
##     goout + alc_use + absences
## 
##              Df Deviance    AIC
## - Fedu        1   3325.7 1944.7
## - famrel      1   3327.1 1944.9
## - alc_use     1   3327.5 1944.9
## - age         1   3329.4 1945.1
## - absences    1   3330.6 1945.3
## - Pstatus     1   3332.3 1945.5
## - sex         1   3332.3 1945.5
## - traveltime  1   3336.5 1946.0
## - famsup      1   3336.9 1946.0
## <none>            3322.9 1946.4
## - internet    1   3341.3 1946.5
## - address     1   3342.2 1946.6
## - romantic    1   3344.5 1946.9
## + activities  1   3322.6 1948.4
## + freetime    1   3322.9 1948.4
## - goout       1   3363.4 1949.0
## - Medu        1   3366.6 1949.4
## - studytime   1   3382.8 1951.2
## - higher      1   3498.6 1964.1
## 
## Step:  AIC=1944.72
## G3 ~ sex + age + address + Medu + Pstatus + traveltime + studytime + 
##     famsup + higher + internet + famrel + romantic + goout + 
##     alc_use + absences
## 
##              Df Deviance    AIC
## - famrel      1   3330.0 1943.2
## - alc_use     1   3330.1 1943.2
## - age         1   3332.5 1943.5
## - absences    1   3333.7 1943.6
## - sex         1   3334.9 1943.8
## - Pstatus     1   3335.1 1943.8
## - famsup      1   3338.6 1944.2
## - traveltime  1   3340.6 1944.4
## <none>            3325.7 1944.7
## - internet    1   3344.2 1944.8
## - address     1   3344.2 1944.8
## - romantic    1   3347.0 1945.2
## + Fedu        1   3322.9 1946.4
## + activities  1   3325.3 1946.7
## + freetime    1   3325.7 1946.7
## - goout       1   3366.3 1947.3
## - studytime   1   3383.9 1949.3
## - Medu        1   3417.6 1953.1
## - higher      1   3506.2 1962.9
## 
## Step:  AIC=1943.22
## G3 ~ sex + age + address + Medu + Pstatus + traveltime + studytime + 
##     famsup + higher + internet + romantic + goout + alc_use + 
##     absences
## 
##              Df Deviance    AIC
## - age         1   3336.1 1941.9
## - alc_use     1   3336.2 1941.9
## - absences    1   3338.3 1942.2
## - Pstatus     1   3339.0 1942.2
## - sex         1   3340.6 1942.4
## - famsup      1   3342.9 1942.7
## - traveltime  1   3344.9 1942.9
## <none>            3330.0 1943.2
## - address     1   3348.3 1943.3
## - internet    1   3349.9 1943.5
## - romantic    1   3352.5 1943.8
## + famrel      1   3325.7 1944.7
## + Fedu        1   3327.1 1944.9
## + activities  1   3329.5 1945.2
## + freetime    1   3329.8 1945.2
## - goout       1   3368.2 1945.6
## - studytime   1   3388.6 1947.9
## - Medu        1   3421.8 1951.6
## - higher      1   3512.7 1961.6
## 
## Step:  AIC=1941.92
## G3 ~ sex + address + Medu + Pstatus + traveltime + studytime + 
##     famsup + higher + internet + romantic + goout + alc_use + 
##     absences
## 
##              Df Deviance    AIC
## - alc_use     1   3343.3 1940.7
## - absences    1   3345.8 1941.0
## - Pstatus     1   3345.9 1941.0
## - famsup      1   3347.2 1941.2
## - sex         1   3347.8 1941.2
## - traveltime  1   3351.6 1941.7
## <none>            3336.1 1941.9
## - address     1   3357.0 1942.3
## - internet    1   3357.8 1942.4
## - romantic    1   3361.2 1942.8
## + age         1   3330.0 1943.2
## + famrel      1   3332.5 1943.5
## + Fedu        1   3332.9 1943.5
## + activities  1   3335.4 1943.8
## + freetime    1   3335.9 1943.9
## - goout       1   3379.0 1944.8
## - studytime   1   3392.7 1946.3
## - Medu        1   3430.7 1950.6
## - higher      1   3538.4 1962.4
## 
## Step:  AIC=1940.74
## G3 ~ sex + address + Medu + Pstatus + traveltime + studytime + 
##     famsup + higher + internet + romantic + goout + absences
## 
##              Df Deviance    AIC
## - sex         1   3351.5 1939.7
## - Pstatus     1   3353.5 1939.9
## - famsup      1   3354.4 1940.0
## - absences    1   3356.7 1940.3
## <none>            3343.3 1940.7
## - traveltime  1   3360.9 1940.8
## - internet    1   3364.2 1941.1
## - address     1   3367.1 1941.5
## - romantic    1   3369.4 1941.7
## + alc_use     1   3336.1 1941.9
## + age         1   3336.2 1941.9
## + famrel      1   3337.8 1942.1
## + Fedu        1   3340.3 1942.4
## + activities  1   3342.0 1942.6
## + freetime    1   3343.1 1942.7
## - studytime   1   3407.7 1946.0
## - goout       1   3409.5 1946.2
## - Medu        1   3438.8 1949.5
## - higher      1   3544.7 1961.1
## 
## Step:  AIC=1939.67
## G3 ~ address + Medu + Pstatus + traveltime + studytime + famsup + 
##     higher + internet + romantic + goout + absences
## 
##              Df Deviance    AIC
## - Pstatus     1   3361.3 1938.8
## - famsup      1   3365.7 1939.3
## - absences    1   3367.6 1939.5
## - traveltime  1   3367.9 1939.5
## <none>            3351.5 1939.7
## - address     1   3374.2 1940.2
## - internet    1   3374.9 1940.3
## + sex         1   3343.3 1940.7
## + age         1   3343.7 1940.8
## - romantic    1   3380.3 1941.0
## + famrel      1   3345.2 1941.0
## + alc_use     1   3347.8 1941.2
## + Fedu        1   3348.6 1941.3
## + activities  1   3349.3 1941.4
## + freetime    1   3350.6 1941.6
## - studytime   1   3408.3 1944.1
## - goout       1   3416.2 1945.0
## - Medu        1   3460.2 1949.9
## - higher      1   3545.5 1959.2
## 
## Step:  AIC=1938.78
## G3 ~ address + Medu + traveltime + studytime + famsup + higher + 
##     internet + romantic + goout + absences
## 
##              Df Deviance    AIC
## - absences    1   3374.4 1938.3
## - famsup      1   3376.2 1938.5
## - traveltime  1   3377.5 1938.6
## <none>            3361.3 1938.8
## - internet    1   3382.6 1939.2
## - address     1   3386.2 1939.6
## + Pstatus     1   3351.5 1939.7
## + age         1   3352.6 1939.8
## - romantic    1   3388.5 1939.9
## + sex         1   3353.5 1939.9
## + famrel      1   3355.4 1940.1
## + alc_use     1   3357.1 1940.3
## + Fedu        1   3358.3 1940.5
## + activities  1   3359.9 1940.6
## + freetime    1   3360.6 1940.7
## - studytime   1   3416.9 1943.0
## - goout       1   3428.1 1944.3
## - Medu        1   3479.9 1950.0
## - higher      1   3558.4 1958.6
## 
## Step:  AIC=1938.27
## G3 ~ address + Medu + traveltime + studytime + famsup + higher + 
##     internet + romantic + goout
## 
##              Df Deviance    AIC
## - famsup      1   3390.0 1938.0
## - traveltime  1   3390.0 1938.0
## <none>            3374.4 1938.3
## - internet    1   3393.4 1938.4
## + absences    1   3361.3 1938.8
## + age         1   3363.4 1939.0
## + sex         1   3364.2 1939.1
## - address     1   3402.2 1939.4
## + famrel      1   3367.5 1939.5
## + Pstatus     1   3367.6 1939.5
## + alc_use     1   3367.8 1939.5
## - romantic    1   3405.0 1939.7
## + Fedu        1   3371.1 1939.9
## + activities  1   3373.0 1940.1
## + freetime    1   3373.2 1940.1
## - studytime   1   3437.3 1943.3
## - goout       1   3447.8 1944.5
## - Medu        1   3487.6 1948.9
## - higher      1   3578.7 1958.7
## 
## Step:  AIC=1938.04
## G3 ~ address + Medu + traveltime + studytime + higher + internet + 
##     romantic + goout
## 
##              Df Deviance    AIC
## - traveltime  1   3406.4 1937.9
## - internet    1   3407.3 1938.0
## <none>            3390.0 1938.0
## + famsup      1   3374.4 1938.3
## + sex         1   3376.1 1938.5
## + absences    1   3376.2 1938.5
## + age         1   3381.6 1939.1
## - address     1   3418.3 1939.2
## + Pstatus     1   3382.7 1939.2
## + famrel      1   3382.7 1939.2
## + alc_use     1   3384.0 1939.4
## - romantic    1   3421.0 1939.5
## + activities  1   3387.9 1939.8
## + Fedu        1   3388.0 1939.8
## + freetime    1   3388.9 1939.9
## - studytime   1   3445.9 1942.3
## - goout       1   3462.1 1944.1
## - Medu        1   3494.4 1947.6
## - higher      1   3589.9 1957.9
## 
## Step:  AIC=1937.88
## G3 ~ address + Medu + studytime + higher + internet + romantic + 
##     goout
## 
##              Df Deviance    AIC
## - internet    1   3424.1 1937.9
## <none>            3406.4 1937.9
## + traveltime  1   3390.0 1938.0
## + famsup      1   3390.0 1938.0
## + absences    1   3393.3 1938.4
## + sex         1   3394.2 1938.5
## + age         1   3397.3 1938.9
## + alc_use     1   3398.4 1939.0
## + famrel      1   3399.1 1939.1
## + Pstatus     1   3399.1 1939.1
## - romantic    1   3437.4 1939.3
## + Fedu        1   3403.3 1939.5
## + activities  1   3404.4 1939.7
## + freetime    1   3405.0 1939.7
## - address     1   3455.4 1941.3
## - studytime   1   3469.6 1942.9
## - goout       1   3484.4 1944.5
## - Medu        1   3528.0 1949.3
## - higher      1   3605.9 1957.6
## 
## Step:  AIC=1937.86
## G3 ~ address + Medu + studytime + higher + romantic + goout
## 
##              Df Deviance    AIC
## <none>            3424.1 1937.9
## + internet    1   3406.4 1937.9
## + traveltime  1   3407.3 1938.0
## + famsup      1   3409.4 1938.2
## + sex         1   3409.7 1938.2
## + age         1   3413.2 1938.6
## + absences    1   3413.2 1938.7
## + famrel      1   3415.2 1938.9
## - romantic    1   3452.0 1939.0
## + alc_use     1   3417.5 1939.1
## + Pstatus     1   3418.2 1939.2
## + Fedu        1   3420.7 1939.5
## + activities  1   3421.3 1939.5
## + freetime    1   3421.8 1939.6
## - address     1   3486.8 1942.8
## - studytime   1   3490.8 1943.2
## - goout       1   3497.2 1943.9
## - Medu        1   3565.4 1951.3
## - higher      1   3618.2 1956.9
## 
## Call:  glm(formula = G3 ~ address + Medu + studytime + higher + romantic + 
##     goout, family = "gaussian", data = data_reg)
## 
## Coefficients:
## (Intercept)     addressU         Medu    studytime    higheryes  
##      6.0992       1.0045       0.5743       0.5093       3.5013  
## romanticyes        goout  
##     -0.5891      -0.3955  
## 
## Degrees of Freedom: 381 Total (i.e. Null);  375 Residual
## Null Deviance:       4175 
## Residual Deviance: 3424  AIC: 1938
#I got this: glm(G3~ address + Medu + studytime + higher + romantic 
#goout,data=data_reg,family ="gaussian")


#Thereafter, anova Chi square test and the T.test from the model summary
#were used to further eliminate the insiginificant variables at level 0.05
#After this, "romantic" was slightly below the level and hence, removed
grade_glm<-glm(G3~ address + Medu  + studytime + higher + 
                 goout,data=data_reg,family ="gaussian")

#Anova test
anova(grade_glm, test="Chisq")
## Analysis of Deviance Table
## 
## Model: gaussian, link: identity
## 
## Response: G3
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        381     4174.8              
## address    1   88.385       380     4086.4 0.0019172 ** 
## Medu       1  193.900       379     3892.5 4.314e-06 ***
## studytime  1  122.115       378     3770.4 0.0002652 ***
## higher     1  245.311       377     3525.1 2.352e-07 ***
## goout      1   73.150       376     3452.0 0.0047619 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova Chi square test and the T.test from the model summary were used to further eliminate the insiginificant variables at level 0.05.

Here, we can see that the significant factors affecting students’ grades include, address, mother’s level of education, studytime, ambition to further higher education, and going out. Mother’s education and higher education ambition seem to be the most significatn factors. However, mother’s education level is not curvillear(i.e does not have higher order polynomial(i.e “I(Medu^2)”)

In accordance to the principle of parsimony, I decided to use variables with significance level of 0.05 Therefore, Final model: grade_glm<-glm(G3~ address + Medu + studytime + higher + goout,data=data_reg,family =“gaussian”)

let’s also see the if any observation has a leverage and to further test the normality assumption while I also explore the how the residuals differ from the fitted

par(mfrow=c(2,2))
plot(grade_glm, which = c(1,2,5))

from this, the constant variance of the error seem to be in line. The distribution also appears to be normal from the Normal Q-Q plot although, slight divergence from the line at the lower level. Also, no observation seem to have a clear leverage over others aside a point lying quite farther from others. Nevertheless, gaussian distribution can be safely utilised for modelling the grde(G3) response variable.

Mean Error and RMSE

Create Functions to calculate Mean Error and Root Mean Square Error(RMSE)

# function to calculate the mean absolute and RMSE
#function to calculate mean error
mean_error<- function(obs, pred){
  me<-mean(abs(obs-pred))
  return(me)
}

# Function that returns Root Mean Squared Error
rmse <- function(obs, pred){
  rmse<-sqrt(mean((obs-pred)^2))
  return(rmse)
}

BOOSTRAPPING AND REGRESSION MODELLING(GLM, GAM & GBM).

EVALUATION AND TESTING OF MODELS

MODELLING GRADE(G3).

This part is for the model validation. The data was divided into 70% training/calibration data and 30% testing/evaluation data. It was also boostrapped 10 times. The results of the models and the response curves were thereafter presented. dividing into 70:30

#First copy, the original data into new dataframe, to avoid confusion
data_reg<-alco_data
{rep<-10
  for (i in 1:rep){
    #print the index to see the iteration
    print(i)
    
    #Creare a 70 sample(with replacement) from the original data
    rand<- sample(1:nrow(data_reg), size = 0.7*nrow(data_reg))
    
    #70% for the train/calibration data
    cal<- data_reg[rand,]
    
    #remaining 30 for the test/evaluation data
    eva<-data_reg[-rand,]
    
    ####GLM
    #perform a Genelralised Linear Model(GLM)
    grade_glm <- glm(G3~address + Medu  + studytime + higher + 
                       goout, data=cal, family = "poisson") 
    
    #predict into the test/evaluation data
    grade_glm_pred<- predict.glm(object = grade_glm, newdata = eva, type="response")
    
    #find the correlation between the train and test data.
    cor_glm_grade<-cor(grade_glm_pred, eva$G3, method = "spearman")
    
    
    #########
    #mean error and root mean square error
    #calculate the mean error
    error_grade_glm<- cbind.data.frame(grade_glm_pred, eva$G3)
    colnames(error_grade_glm) <- c("pred_glm_grade", "obs_grade")
    
    #Use the function created earlier to calulcate the mean error and RMSE.
    #Mean error
    grade_glm_me <- mean_error(error_grade_glm$obs_grade, error_grade_glm$pred_glm_grade)
    
    #RMSE
    grade_glm_rmse <- rmse(error_grade_glm$obs_grade, error_grade_glm$pred_glm_grade)
    
    #combine the dataframe of the mean error and RMSE
    me_rmse_grade_glm <- rbind.data.frame(grade_glm_me, grade_glm_rmse)
    
    #Change the column name to something more descriptive.
    colnames(me_rmse_grade_glm)<- c("grade_glm")
    
    
    
    
    #GAM
    grade_gam <- gam(G3~ s(Medu, k=3) + higher + address + s(studytime, k=3) + higher +
                       goout, data = cal, family = "gaussian")
    
    grade_gam_pred <- predict.gam(grade_gam, newdata = eva, type = "response")
    
    obs_pred_grade_gam<- cbind.data.frame(grade_gam_pred, eva$G3)
    colnames(obs_pred_grade_gam) <- c("pred_gam_grade", "obs_gam_grade")
    
    #you can just calculate the correlation straight away
    cor_gam_grade <- cor(grade_gam_pred, eva$G3, method = "spearman")
    
    
    #########
    #mean error and root mean square error
    error_grade_gam<- cbind.data.frame(grade_gam_pred, eva$G3)
    colnames(error_grade_gam) <- c("pred_gam_grade", "obs_grade")
    
    grade_gam_me <- mean_error(error_grade_gam$obs_grade, error_grade_gam$pred_gam_grade)
    grade_gam_rmse <- rmse(error_grade_gam$obs_grade, error_grade_gam$pred_gam_grade)
    
    me_rmse_grade_gam <- rbind.data.frame(grade_gam_me, grade_gam_rmse)
    colnames(me_rmse_grade_gam)<- c("grade_gam")
    
    
    
    
    ###################################################################3
    #using the normal gbm, package.
    #GBM
    grade_gbm1<-gbm(formula = G3~ sex + age+address+Medu+Fedu+
                      Pstatus+ traveltime+studytime+famsup+activities+higher+
                      internet+famrel+romantic+freetime+goout+ alc_use+ absences, data=cal,
                   distribution = "gaussian",n.trees = 1000, shrinkage = 0.001, interaction.depth = 6,
                   bag.fraction = 0.75)
    
    best.iter<-gbm.perf(grade_gbm1, plot.it = F, method = "OOB")
    grade_gbm1_pred<- predict.gbm(object = grade_gbm1, newdata = eva, best.iter,
                                 type="response")
    cor_gbm1_grade <- cor(grade_gbm1_pred, eva$G3, method = "spearman")
    
    
    
    #########
    #mean error and root mean square error
    error_grade_gbm1<- cbind.data.frame(grade_gbm1_pred, eva$G3)
    colnames(error_grade_gbm1) <- c("pred_gbm1_grade", "obs_grade")
    
    grade_gbm1_me <- mean_error(error_grade_gbm1$obs_grade, error_grade_gbm1$pred_gbm1_grade)
    grade_gbm1_rmse <- rmse(error_grade_gbm1$obs_grade, error_grade_gbm1$pred_gbm1_grade)
    
    me_rmse_grade_gbm1 <- rbind.data.frame(grade_gbm1_me, grade_gbm1_rmse)
    colnames(me_rmse_grade_gbm1)<- c("grade_gbm1")
    
    
    ###################################################
    #GBM(dismo package)
    
    grade_gbm2 <- gbm.step(data=cal, gbm.x =c("sex", "age","address","Medu"
                 ,"Fedu","Pstatus", "traveltime","studytime","famsup","activities","higher",
                 "internet","famrel","romantic","freetime","goout", "alc_use", "absences"), gbm.y = "G3",
                          bag.fraction=0.75, learning.rate = 0.001,
                          family="gaussian",n.trees=50, n.folds=10,
                          max.trees = 1000, tree.complexity = 6)
    
    #This also works but can be done directly as shown in the prediction after this
    #best.iter2<-gbm.perf(grade_gbm1, plot.it = T, method = "OOB")
    # grade_gbm2_pred<- predict.gbm(object = grade_gbm2, newdata = eva, best.iter2,
    #                              type="response")
    
    #
    #the prediction
    grade_gbm2_pred <- predict.gbm(grade_gbm2, newdata = eva, n.trees=grade_gbm2$n.trees, type = "response")
    #grade_pred_gbm2_ras <- predict(object=ras_stack,model=grade_gbm2, fun=predict,
    #                         n.trees=grade_gbm2$n.trees, type="response")
    
    #plot(grade_pred_gbm2_ras)
    cor_gbm2_grade <- cor(grade_gbm2_pred, eva$G3, method = "spearman")
    
    #########
    #mean error and root mean square error
    error_grade_gbm2<- cbind.data.frame(grade_gbm2_pred, eva$G3)
    colnames(error_grade_gbm2) <- c("pred_gbm2_grade", "obs_grade")
    
    grade_gbm2_me <- mean_error(error_grade_gbm2$obs_grade, error_grade_gbm2$pred_gbm2_grade)
    grade_gbm2_rmse <- rmse(error_grade_gbm2$obs_grade, error_grade_gbm2$pred_gbm2_grade)
    
    me_rmse_grade_gbm2 <- rbind.data.frame(grade_gbm2_me, grade_gbm2_rmse)
    colnames(me_rmse_grade_gbm2)<- c("grade_gbm2")
    
  } 
  #####All correlation
  all_cor_grade <- cbind.data.frame(cor_glm_grade,cor_gam_grade,
                                   cor_gbm1_grade, cor_gbm2_grade)
  colnames(all_cor_grade)<- c("grade_glm", "grade_gam", "grade_gbm1", "grade_gbm2")
  
  #####all error
  all_error_grade <- cbind.data.frame(me_rmse_grade_glm, me_rmse_grade_gam,
                                     me_rmse_grade_gbm1, me_rmse_grade_gbm2)
  rownames(all_error_grade)<- c("mean abs error", "RMSE")
  
}
## [1] 1
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  8.7939 
## tolerance is fixed at  0.0088 
## ntrees resid. dev. 
## 50    8.8165 
## now adding trees... 
## 100   8.7886 
## 150   8.7567 
## 200   8.7297 
## 250   8.7006 
## 300   8.6819 
## 350   8.6623 
## 400   8.648 
## 450   8.6332 
## 500   8.621 
## 550   8.6123 
## 600   8.6023 
## 650   8.5969 
## 700   8.5886 
## 750   8.5841 
## 800   8.5754 
## 850   8.5717 
## 900   8.5703 
## 950   8.5693 
## 1000   8.5665

## 
## mean total deviance = 8.794 
## mean residual deviance = 7.012 
##  
## estimated cv deviance = 8.567 ; se = 0.594 
##  
## training data correlation = 0.603 
## cv correlation =  0.178 ; se = 0.047 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 2
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  10.9353 
## tolerance is fixed at  0.0109 
## ntrees resid. dev. 
## 50    10.9176 
## now adding trees... 
## 100   10.815 
## 150   10.7127 
## 200   10.6227 
## 250   10.5419 
## 300   10.4624 
## 350   10.387 
## 400   10.319 
## 450   10.2563 
## 500   10.2015 
## 550   10.1449 
## 600   10.0895 
## 650   10.0426 
## 700   9.9966 
## 750   9.9549 
## 800   9.9163 
## 850   9.8814 
## 900   9.8498 
## 950   9.8182 
## 1000   9.7973

## 
## mean total deviance = 10.935 
## mean residual deviance = 8.089 
##  
## estimated cv deviance = 9.797 ; se = 0.873 
##  
## training data correlation = 0.642 
## cv correlation =  0.339 ; se = 0.045 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 3
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  10.1374 
## tolerance is fixed at  0.0101 
## ntrees resid. dev. 
## 50    10.1127 
## now adding trees... 
## 100   10.037 
## 150   9.9703 
## 200   9.9164 
## 250   9.8651 
## 300   9.8181 
## 350   9.7813 
## 400   9.7415 
## 450   9.7154 
## 500   9.6829 
## 550   9.6531 
## 600   9.6348 
## 650   9.6159 
## 700   9.5954 
## 750   9.5782 
## 800   9.5624 
## 850   9.5494 
## 900   9.5392 
## 950   9.5279 
## 1000   9.517

## 
## mean total deviance = 10.137 
## mean residual deviance = 7.69 
##  
## estimated cv deviance = 9.517 ; se = 0.758 
##  
## training data correlation = 0.625 
## cv correlation =  0.265 ; se = 0.043 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 4
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  10.5415 
## tolerance is fixed at  0.0105 
## ntrees resid. dev. 
## 50    10.5096 
## now adding trees... 
## 100   10.438 
## 150   10.3736 
## 200   10.3146 
## 250   10.2625 
## 300   10.2099 
## 350   10.1721 
## 400   10.1377 
## 450   10.1031 
## 500   10.0721 
## 550   10.043 
## 600   10.0145 
## 650   9.9933 
## 700   9.9719 
## 750   9.9531 
## 800   9.9349 
## 850   9.9194 
## 900   9.9037 
## 950   9.8879 
## 1000   9.8695

## 
## mean total deviance = 10.542 
## mean residual deviance = 7.93 
##  
## estimated cv deviance = 9.869 ; se = 0.529 
##  
## training data correlation = 0.647 
## cv correlation =  0.258 ; se = 0.046 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 5
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  10.2566 
## tolerance is fixed at  0.0103 
## ntrees resid. dev. 
## 50    10.2809 
## now adding trees... 
## 100   10.2246 
## 150   10.1745 
## 200   10.1239 
## 250   10.0746 
## 300   10.0301 
## 350   9.9894 
## 400   9.9485 
## 450   9.9196 
## 500   9.8957 
## 550   9.8639 
## 600   9.8361 
## 650   9.814 
## 700   9.7971 
## 750   9.7783 
## 800   9.762 
## 850   9.7545 
## 900   9.7386 
## 950   9.7219 
## 1000   9.713

## 
## mean total deviance = 10.257 
## mean residual deviance = 7.828 
##  
## estimated cv deviance = 9.713 ; se = 0.645 
##  
## training data correlation = 0.634 
## cv correlation =  0.238 ; se = 0.052 
##  
## elapsed time -  0.18 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 6
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  10.9166 
## tolerance is fixed at  0.0109 
## ntrees resid. dev. 
## 50    10.877 
## now adding trees... 
## 100   10.7766 
## 150   10.6818 
## 200   10.5994 
## 250   10.5231 
## 300   10.4488 
## 350   10.3802 
## 400   10.3169 
## 450   10.2573 
## 500   10.2042 
## 550   10.1592 
## 600   10.1237 
## 650   10.0839 
## 700   10.042 
## 750   10.0126 
## 800   9.9767 
## 850   9.9502 
## 900   9.9316 
## 950   9.9097 
## 1000   9.89

## 
## mean total deviance = 10.917 
## mean residual deviance = 8.031 
##  
## estimated cv deviance = 9.89 ; se = 0.863 
##  
## training data correlation = 0.635 
## cv correlation =  0.298 ; se = 0.061 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 7
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  10.9652 
## tolerance is fixed at  0.011 
## ntrees resid. dev. 
## 50    10.9315 
## now adding trees... 
## 100   10.8772 
## 150   10.8351 
## 200   10.7913 
## 250   10.747 
## 300   10.7086 
## 350   10.6744 
## 400   10.6383 
## 450   10.6228 
## 500   10.6039 
## 550   10.5884 
## 600   10.5684 
## 650   10.5566 
## 700   10.5443 
## 750   10.5322 
## 800   10.5242 
## 850   10.5121 
## 900   10.5099 
## 950   10.5045 
## 1000   10.5029

## 
## mean total deviance = 10.965 
## mean residual deviance = 8.435 
##  
## estimated cv deviance = 10.503 ; se = 1.021 
##  
## training data correlation = 0.631 
## cv correlation =  0.211 ; se = 0.052 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 8
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  10.7438 
## tolerance is fixed at  0.0107 
## ntrees resid. dev. 
## 50    10.6394 
## now adding trees... 
## 100   10.5352 
## 150   10.4387 
## 200   10.3513 
## 250   10.2712 
## 300   10.2035 
## 350   10.14 
## 400   10.0871 
## 450   10.0398 
## 500   9.9957 
## 550   9.9518 
## 600   9.9172 
## 650   9.8844 
## 700   9.8561 
## 750   9.8327 
## 800   9.812 
## 850   9.7917 
## 900   9.7734 
## 950   9.7583 
## 1000   9.7459

## 
## mean total deviance = 10.744 
## mean residual deviance = 8 
##  
## estimated cv deviance = 9.746 ; se = 0.684 
##  
## training data correlation = 0.622 
## cv correlation =  0.309 ; se = 0.059 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 9
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  10.2843 
## tolerance is fixed at  0.0103 
## ntrees resid. dev. 
## 50    10.2681 
## now adding trees... 
## 100   10.1843 
## 150   10.1075 
## 200   10.0367 
## 250   9.9759 
## 300   9.9167 
## 350   9.8614 
## 400   9.8137 
## 450   9.7705 
## 500   9.7301 
## 550   9.7031 
## 600   9.6721 
## 650   9.6434 
## 700   9.6211 
## 750   9.5998 
## 800   9.5814 
## 850   9.5662 
## 900   9.5568 
## 950   9.5392 
## 1000   9.5296

## 
## mean total deviance = 10.284 
## mean residual deviance = 7.753 
##  
## estimated cv deviance = 9.53 ; se = 0.743 
##  
## training data correlation = 0.631 
## cv correlation =  0.296 ; se = 0.07 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 10
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of gaussian 
## Using 267 observations and 18 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  10.4431 
## tolerance is fixed at  0.0104 
## ntrees resid. dev. 
## 50    10.4748 
## now adding trees... 
## 100   10.3803 
## 150   10.2992 
## 200   10.2174 
## 250   10.1437 
## 300   10.081 
## 350   10.0145 
## 400   9.959 
## 450   9.9029 
## 500   9.8558 
## 550   9.8171 
## 600   9.7766 
## 650   9.7342 
## 700   9.7024 
## 750   9.6758 
## 800   9.6466 
## 850   9.622 
## 900   9.5965 
## 950   9.5781 
## 1000   9.5551

## 
## mean total deviance = 10.443 
## mean residual deviance = 7.655 
##  
## estimated cv deviance = 9.555 ; se = 0.898 
##  
## training data correlation = 0.642 
## cv correlation =  0.314 ; se = 0.067 
##  
## elapsed time -  0.18 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees

VALIDATION OF REGRESSION MODELS FOR GRADE(G3)

let’s see the correlation between the predicted and observed response variable

#correlation between the predicted and observed grade(G3)
all_cor_grade
##   grade_glm grade_gam grade_gbm1 grade_gbm2
## 1 0.2734951 0.2807136  0.2540703  0.2529366

They all are low and not so much different.

**Below, we can see the errors of the various models.

#error of all the models
all_error_grade
##                grade_glm grade_gam grade_gbm1 grade_gbm2
## mean abs error  2.624296  2.628967   2.653752   2.651859
## RMSE            3.321220  3.327466   3.372944   3.367398

There appears to be not much difference in the errors of all the models used.

GAM and GBM for GRADE(G3) MODELLING

I chose GBM because it is able to handle multicollinearity and complex interactions, it can also show the response curves and relative importance of predictors. GAM is also able to show response curves and their confidence intervals.

SUMMARY OF THE GBM FOR GRADE(G3)

summary(grade_gbm1)

##                   var    rel.inf
## higher         higher 26.0188316
## Medu             Medu 12.5100743
## alc_use       alc_use  9.2399806
## absences     absences  7.8489797
## age               age  6.8760066
## studytime   studytime  6.6032340
## traveltime traveltime  6.1522710
## goout           goout  5.7957168
## Fedu             Fedu  3.9204703
## address       address  3.4998539
## freetime     freetime  2.6551947
## famsup         famsup  1.7636931
## internet     internet  1.7039128
## sex               sex  1.6437026
## romantic     romantic  1.2318752
## famrel         famrel  1.2249596
## activities activities  0.9150867
## Pstatus       Pstatus  0.3961565

here, we can see that higher education pursuit, mother’s educaton level, alc use, and absences seem to be most important factors affecting students performances. This is quite in line with the results I got from GLM and GAM, however, only mother’s education level and higher_use seem to be the significant factors. The least important factors are also shown in the GBM’s summary of relative importance.

RESONSE CURVES(GAM GBM,)

now, let’s see the response curves of this from GAM

grade_gam <- gam(G3~ s(Medu, k=3) + higher + address +                         s(studytime, k=3) + higher + goout, data =                       data_reg,family = "gaussian")
plot(grade_gam, pages=1)

Here, from the smooth curve from GAM, we can see that mother’s education level has a positive effect on student’s performance. However, the confidence interval especially at the lower level. is quite large, which shows that there is a wide range of uncertainty. Perharps, more observations needed. This will be explored further in the response curves from GBM below.

best.iter1<-gbm.perf(grade_gbm1, plot.it = F, method = "OOB")

par(mfrow=c(2,2))
plot.gbm(grade_gbm1, "Medu", best.iter1)
plot.gbm(grade_gbm1, "higher", best.iter1)
plot.gbm(grade_gbm1, "alc_use", best.iter1)
plot.gbm(grade_gbm1, "absences", best.iter1)

par(mfrow=c(1,1))

As we can see here, the higher the level of education of the mother, their kids seem to perform better. Also, studentsä with intention to pursue higher education seem to perform better. Alcohol use and absences reduces performance and can result in dramatic reduction if it becomes chronic.

**PREDICTED VS OBSERVED GRAGE(G3) FROM GBM.

plot(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3, 
     main="Observed vs Predicted grade")
lines(lowess(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3), col="red", lwd=3)
r_grade <-cor.test(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3)
r2grade <- r_grade$estimate^2
r2grade
##       cor 
## 0.3260634
legend("topleft", paste("r^2=", round(r2grade,3)))

We can see from the scatterplots that the selected variables, account for only 32% factors affecting the students’ grades.

MODELLING ABSENCE(GLM, GAM, GBM)

MODEL EVALUATION AND VALIDATION.

For GLM, the selection of predictors was done as earlier by using stepwise regression, anova(Chis sq) test and significance test from the model summary.

## 
## Call:
## glm(formula = absences ~ sex + age + address + Medu + Fedu + 
##     Pstatus + traveltime + studytime + famsup + activities + 
##     higher + internet + famrel + romantic + freetime + goout + 
##     high_use + G3, family = "poisson", data = data_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4872  -1.7699  -0.6181   0.7933   8.5118  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.74725    0.43714   1.709  0.08737 .  
## sexM          -0.33113    0.05515  -6.004 1.92e-09 ***
## age            0.09829    0.02175   4.520 6.17e-06 ***
## addressU      -0.13937    0.06329  -2.202  0.02767 *  
## Medu           0.12363    0.03159   3.914 9.07e-05 ***
## Fedu          -0.04894    0.02963  -1.652  0.09860 .  
## PstatusT      -0.43752    0.07029  -6.224 4.85e-10 ***
## traveltime    -0.07698    0.03844  -2.003  0.04521 *  
## studytime     -0.19572    0.03515  -5.568 2.58e-08 ***
## famsupyes      0.06916    0.05329   1.298  0.19433    
## activitiesyes  0.13248    0.05151   2.572  0.01011 *  
## higheryes     -0.17379    0.11194  -1.553  0.12052    
## internetyes    0.35447    0.08038   4.410 1.03e-05 ***
## famrel        -0.02684    0.02678  -1.002  0.31631    
## romanticyes    0.15072    0.05288   2.850  0.00437 ** 
## freetime      -0.08304    0.02707  -3.068  0.00216 ** 
## goout          0.04036    0.02434   1.658  0.09727 .  
## high_useTRUE   0.49462    0.05598   8.836  < 2e-16 ***
## G3            -0.01864    0.00819  -2.276  0.02287 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1917.2  on 381  degrees of freedom
## Residual deviance: 1582.2  on 363  degrees of freedom
## AIC: 2639.8
## 
## Number of Fisher Scoring iterations: 5
## Start:  AIC=2639.77
## absences ~ sex + age + address + Medu + Fedu + Pstatus + traveltime + 
##     studytime + famsup + activities + higher + internet + famrel + 
##     romantic + freetime + goout + high_use + G3
## 
##              Df Deviance    AIC
## - famrel      1   1583.2 2638.8
## - famsup      1   1583.8 2639.5
## <none>            1582.2 2639.8
## - higher      1   1584.5 2640.1
## - Fedu        1   1584.9 2640.5
## - goout       1   1584.9 2640.5
## - traveltime  1   1586.2 2641.9
## - address     1   1586.9 2642.5
## - G3          1   1587.3 2642.9
## - activities  1   1588.8 2644.4
## - romantic    1   1590.2 2645.8
## - freetime    1   1591.5 2647.2
## - Medu        1   1597.6 2653.2
## - age         1   1602.4 2658.0
## - internet    1   1603.1 2658.7
## - studytime   1   1614.3 2669.9
## - Pstatus     1   1617.6 2673.2
## - sex         1   1618.5 2674.1
## - high_use    1   1658.0 2713.6
## 
## Step:  AIC=2638.77
## absences ~ sex + age + address + Medu + Fedu + Pstatus + traveltime + 
##     studytime + famsup + activities + higher + internet + romantic + 
##     freetime + goout + high_use + G3
## 
##              Df Deviance    AIC
## - famsup      1   1584.9 2638.5
## <none>            1583.2 2638.8
## - higher      1   1585.5 2639.2
## - Fedu        1   1585.8 2639.4
## - goout       1   1585.9 2639.5
## - traveltime  1   1587.2 2640.8
## - address     1   1587.8 2641.5
## - G3          1   1588.5 2642.1
## - activities  1   1589.5 2643.2
## - romantic    1   1591.4 2645.0
## - freetime    1   1593.5 2647.1
## - Medu        1   1598.7 2652.3
## - age         1   1602.8 2656.4
## - internet    1   1603.8 2657.5
## - studytime   1   1615.2 2668.8
## - Pstatus     1   1618.7 2672.3
## - sex         1   1619.9 2673.5
## - high_use    1   1663.2 2716.8
## 
## Step:  AIC=2638.49
## absences ~ sex + age + address + Medu + Fedu + Pstatus + traveltime + 
##     studytime + activities + higher + internet + romantic + freetime + 
##     goout + high_use + G3
## 
##              Df Deviance    AIC
## <none>            1584.9 2638.5
## - higher      1   1587.0 2638.6
## - Fedu        1   1587.1 2638.7
## - goout       1   1587.6 2639.2
## - traveltime  1   1588.6 2640.2
## - address     1   1589.9 2641.5
## - G3          1   1590.7 2642.3
## - activities  1   1590.8 2642.4
## - romantic    1   1593.2 2644.8
## - freetime    1   1595.1 2646.7
## - Medu        1   1601.0 2652.7
## - age         1   1603.7 2655.3
## - internet    1   1606.5 2658.1
## - studytime   1   1615.5 2667.1
## - Pstatus     1   1619.8 2671.4
## - sex         1   1623.8 2675.4
## - high_use    1   1664.6 2716.3
## 
## Call:  glm(formula = absences ~ sex + age + address + Medu + Fedu + 
##     Pstatus + traveltime + studytime + activities + higher + 
##     internet + romantic + freetime + goout + high_use + G3, family = "poisson", 
##     data = data_reg)
## 
## Coefficients:
##   (Intercept)           sexM            age       addressU           Medu  
##       0.73842       -0.34066        0.09383       -0.14284        0.12595  
##          Fedu       PstatusT     traveltime      studytime  activitiesyes  
##      -0.04394       -0.43378       -0.07353       -0.18947        0.12497  
##     higheryes    internetyes    romanticyes       freetime          goout  
##      -0.16371        0.35877        0.15299       -0.08587        0.03984  
##  high_useTRUE             G3  
##       0.50207       -0.01980  
## 
## Degrees of Freedom: 381 Total (i.e. Null);  365 Residual
## Null Deviance:       1917 
## Residual Deviance: 1585  AIC: 2638
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: absences
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                         381     1917.2              
## sex         1    6.559       380     1910.7 0.0104342 *  
## age         1   39.964       379     1870.7 2.587e-10 ***
## address     1    0.844       378     1869.8 0.3582118    
## Medu        1   30.346       377     1839.5 3.615e-08 ***
## Fedu        1    1.355       376     1838.1 0.2444438    
## Pstatus     1   37.928       375     1800.2 7.340e-10 ***
## traveltime  1    0.019       374     1800.2 0.8917349    
## studytime   1   52.787       373     1747.4 3.717e-13 ***
## famsup      1    2.517       372     1744.9 0.1126040    
## activities  1    3.515       371     1741.4 0.0608124 .  
## higher      1   11.233       370     1730.1 0.0008034 ***
## internet    1   27.982       369     1702.2 1.224e-07 ***
## famrel      1    7.405       368     1694.8 0.0065034 ** 
## romantic    1    3.782       367     1691.0 0.0518000 .  
## freetime    1    1.799       366     1689.2 0.1798908    
## goout       1   24.357       365     1664.8 8.003e-07 ***
## high_use    1   77.516       364     1587.3 < 2.2e-16 ***
## G3          1    5.148       363     1582.2 0.0232754 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Final model

abse_glm<-glm(absences ~ sex + age + Medu + 
                Pstatus +  studytime +  higher + 
                internet + goout + alc_use + G3
              ,data=data_reg,family ="poisson")


summary(abse_glm)
## 
## Call:
## glm(formula = absences ~ sex + age + Medu + Pstatus + studytime + 
##     higher + internet + goout + alc_use + G3, family = "poisson", 
##     data = data_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1134  -1.8059  -0.6326   0.8032   9.8400  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.170583   0.410599   0.415  0.67781    
## sexM        -0.406371   0.054637  -7.438 1.02e-13 ***
## age          0.095102   0.020967   4.536 5.74e-06 ***
## Medu         0.116579   0.024475   4.763 1.91e-06 ***
## PstatusT    -0.458253   0.069217  -6.621 3.58e-11 ***
## studytime   -0.156525   0.033700  -4.645 3.41e-06 ***
## higheryes   -0.220171   0.108890  -2.022  0.04318 *  
## internetyes  0.349942   0.078836   4.439 9.04e-06 ***
## goout        0.007283   0.023303   0.313  0.75463    
## alc_use      0.220332   0.026020   8.468  < 2e-16 ***
## G3          -0.022122   0.008040  -2.751  0.00593 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1917.2  on 381  degrees of freedom
## Residual deviance: 1622.0  on 371  degrees of freedom
## AIC: 2663.6
## 
## Number of Fisher Scoring iterations: 5
anova(abse_glm, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: absences
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        381     1917.2              
## sex        1    6.559       380     1910.7  0.010434 *  
## age        1   39.964       379     1870.7 2.587e-10 ***
## Medu       1   28.611       378     1842.1 8.848e-08 ***
## Pstatus    1   37.104       377     1805.0 1.120e-09 ***
## studytime  1   50.651       376     1754.3 1.103e-12 ***
## higher     1   10.643       375     1743.7  0.001105 ** 
## internet   1   26.691       374     1717.0 2.387e-07 ***
## goout      1   15.196       373     1701.8 9.692e-05 ***
## alc_use    1   72.308       372     1629.5 < 2.2e-16 ***
## G3         1    7.512       371     1622.0  0.006130 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The factors that might be causing absence of students are above.

TRAINING AND TESTING OF MODELS FOR “ABSENCES”

## [1] 1
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  4.9355 
## tolerance is fixed at  0.0049 
## ntrees resid. dev. 
## 50    4.9413 
## now adding trees... 
## 100   4.9032 
## 150   4.873 
## 200   4.8452 
## 250   4.8221 
## 300   4.802 
## 350   4.783 
## 400   4.7666 
## 450   4.7533 
## 500   4.7401 
## 550   4.7287 
## 600   4.7177 
## 650   4.7097 
## 700   4.7022 
## 750   4.6947 
## 800   4.6891 
## 850   4.6851 
## 900   4.68 
## 950   4.6774 
## 1000   4.6734

## 
## mean total deviance = 4.935 
## mean residual deviance = 3.824 
##  
## estimated cv deviance = 4.673 ; se = 0.511 
##  
## training data correlation = 0.601 
## cv correlation =  0.224 ; se = 0.06 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 2
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  5.0147 
## tolerance is fixed at  0.005 
## ntrees resid. dev. 
## 50    4.9987 
## now adding trees... 
## 100   4.9687 
## 150   4.945 
## 200   4.9232 
## 250   4.9021 
## 300   4.8854 
## 350   4.8693 
## 400   4.8555 
## 450   4.8427 
## 500   4.8333 
## 550   4.8247 
## 600   4.8165 
## 650   4.8098 
## 700   4.8033 
## 750   4.7988 
## 800   4.7954 
## 850   4.7938 
## 900   4.7959 
## 950   4.7934 
## 1000   4.7925

## 
## mean total deviance = 5.015 
## mean residual deviance = 3.952 
##  
## estimated cv deviance = 4.793 ; se = 0.509 
##  
## training data correlation = 0.582 
## cv correlation =  0.222 ; se = 0.062 
##  
## elapsed time -  0.18 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 3
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  4.3837 
## tolerance is fixed at  0.0044 
## ntrees resid. dev. 
## 50    4.4015 
## now adding trees... 
## 100   4.3906 
## 150   4.3814 
## 200   4.3722 
## 250   4.3653 
## 300   4.3581 
## 350   4.3563 
## 400   4.3525 
## 450   4.3515 
## 500   4.3513 
## 550   4.3513 
## 600   4.3508 
## 650   4.3537 
## 700   4.3533 
## 750   4.3564 
## 800   4.3595 
## 850   4.3637 
## 900   4.3661 
## 950   4.3719 
## 1000   4.3769

## 
## mean total deviance = 4.384 
## mean residual deviance = 3.78 
##  
## estimated cv deviance = 4.351 ; se = 0.381 
##  
## training data correlation = 0.544 
## cv correlation =  0.127 ; se = 0.059 
##  
## elapsed time -  0.18 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 4
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  4.7289 
## tolerance is fixed at  0.0047 
## ntrees resid. dev. 
## 50    4.8103 
## now adding trees... 
## 100   4.791 
## 150   4.7746 
## 200   4.761 
## 250   4.7493 
## 300   4.7397 
## 350   4.7286 
## 400   4.7196 
## 450   4.7128 
## 500   4.7076 
## 550   4.7048 
## 600   4.7041 
## 650   4.7038 
## 700   4.7041 
## 750   4.7056 
## 800   4.7081 
## 850   4.7104 
## 900   4.7131 
## 950   4.7153 
## 1000   4.7183

## 
## mean total deviance = 4.729 
## mean residual deviance = 3.999 
##  
## estimated cv deviance = 4.704 ; se = 0.629 
##  
## training data correlation = 0.541 
## cv correlation =  0.184 ; se = 0.065 
##  
## elapsed time -  0.18 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 5
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  5.1873 
## tolerance is fixed at  0.0052 
## ntrees resid. dev. 
## 50    5.234 
## now adding trees... 
## 100   5.2184 
## 150   5.2026 
## 200   5.1905 
## 250   5.1767 
## 300   5.1667 
## 350   5.1604 
## 400   5.1525 
## 450   5.1452 
## 500   5.1406 
## 550   5.1342 
## 600   5.134 
## 650   5.1331 
## 700   5.1318 
## 750   5.1311 
## 800   5.1298 
## 850   5.1297 
## 900   5.1296 
## 950   5.1304 
## 1000   5.1322

## 
## mean total deviance = 5.187 
## mean residual deviance = 4.17 
##  
## estimated cv deviance = 5.13 ; se = 0.701 
##  
## training data correlation = 0.565 
## cv correlation =  0.159 ; se = 0.046 
##  
## elapsed time -  0.18 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 6
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  4.7934 
## tolerance is fixed at  0.0048 
## ntrees resid. dev. 
## 50    4.8443 
## now adding trees... 
## 100   4.8208 
## 150   4.7992 
## 200   4.7822 
## 250   4.7659 
## 300   4.7547 
## 350   4.7442 
## 400   4.7319 
## 450   4.7236 
## 500   4.7189 
## 550   4.7173 
## 600   4.7146 
## 650   4.7129 
## 700   4.7136 
## 750   4.7139 
## 800   4.7129 
## 850   4.7143 
## 900   4.7162 
## 950   4.7177 
## 1000   4.7179

## 
## mean total deviance = 4.793 
## mean residual deviance = 3.956 
##  
## estimated cv deviance = 4.713 ; se = 1.161 
##  
## training data correlation = 0.546 
## cv correlation =  0.214 ; se = 0.05 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 7
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  5.7434 
## tolerance is fixed at  0.0057 
## ntrees resid. dev. 
## 50    5.7636 
## now adding trees... 
## 100   5.7322 
## 150   5.7054 
## 200   5.6804 
## 250   5.6583 
## 300   5.6374 
## 350   5.6188 
## 400   5.6024 
## 450   5.5879 
## 500   5.5754 
## 550   5.5636 
## 600   5.5519 
## 650   5.5401 
## 700   5.5291 
## 750   5.517 
## 800   5.5078 
## 850   5.4974 
## 900   5.4903 
## 950   5.4844 
## 1000   5.4767

## 
## mean total deviance = 5.743 
## mean residual deviance = 4.447 
##  
## estimated cv deviance = 5.477 ; se = 0.931 
##  
## training data correlation = 0.617 
## cv correlation =  0.269 ; se = 0.052 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 8
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  5.3767 
## tolerance is fixed at  0.0054 
## ntrees resid. dev. 
## 50    5.3726 
## now adding trees... 
## 100   5.3469 
## 150   5.3292 
## 200   5.3137 
## 250   5.2983 
## 300   5.2833 
## 350   5.2716 
## 400   5.2644 
## 450   5.2578 
## 500   5.2513 
## 550   5.2444 
## 600   5.2397 
## 650   5.2338 
## 700   5.2288 
## 750   5.2269 
## 800   5.2268 
## 850   5.2224 
## 900   5.22 
## 950   5.2161 
## 1000   5.215

## 
## mean total deviance = 5.377 
## mean residual deviance = 4.318 
##  
## estimated cv deviance = 5.215 ; se = 0.803 
##  
## training data correlation = 0.569 
## cv correlation =  0.226 ; se = 0.058 
##  
## elapsed time -  0.18 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 9
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  5.1572 
## tolerance is fixed at  0.0052 
## ntrees resid. dev. 
## 50    5.203 
## now adding trees... 
## 100   5.168 
## 150   5.1386 
## 200   5.1103 
## 250   5.0847 
## 300   5.0616 
## 350   5.0431 
## 400   5.0253 
## 450   5.0099 
## 500   4.9984 
## 550   4.9861 
## 600   4.9791 
## 650   4.9706 
## 700   4.9634 
## 750   4.9575 
## 800   4.9531 
## 850   4.9525 
## 900   4.9505 
## 950   4.9495 
## 1000   4.9472

## 
## mean total deviance = 5.157 
## mean residual deviance = 4.029 
##  
## estimated cv deviance = 4.947 ; se = 0.676 
##  
## training data correlation = 0.591 
## cv correlation =  0.185 ; se = 0.062 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 10
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of poisson 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  4.5998 
## tolerance is fixed at  0.0046 
## ntrees resid. dev. 
## 50    4.6305 
## now adding trees... 
## 100   4.6161 
## 150   4.6044 
## 200   4.5934 
## 250   4.5815 
## 300   4.5711 
## 350   4.5646 
## 400   4.5589 
## 450   4.5523 
## 500   4.5473 
## 550   4.5426 
## 600   4.538 
## 650   4.5342 
## 700   4.5311 
## 750   4.5284 
## 800   4.5256 
## 850   4.5266 
## 900   4.5264 
## 950   4.5258 
## 1000   4.5246

## 
## mean total deviance = 4.6 
## mean residual deviance = 3.662 
##  
## estimated cv deviance = 4.525 ; se = 0.464 
##  
## training data correlation = 0.597 
## cv correlation =  0.156 ; se = 0.058 
##  
## elapsed time -  0.19 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees

correlation between the predicted and observed

#correlation between the predicted and observed for all the models used.
all_cor_abse
##    abse_glm  abse_gam abse_gbm1 abse_gbm2
## 1 0.2246732 0.2231245 0.1993083 0.1881407
#The mean correlation
all_cor_abse
##    abse_glm  abse_gam abse_gbm1 abse_gbm2
## 1 0.2246732 0.2231245 0.1993083 0.1881407

ERRORS OF THE MODELS IN PREDICTING “ABSENCES”

#let's see the errors in the prediction for "absence" response variable.
all_error_abse
##                abse_glm abse_gam abse_gbm1 abse_gbm2
## mean abs error 4.001197 3.968672  4.023673  3.951446
## RMSE           5.942599 5.966504  6.097966  6.048195

FITTING THE REGRESSION FOR “ABSENCES” WITH GBM

##                   var    rel.inf
## alc_use       alc_use 23.2721674
## age               age 14.0840830
## Medu             Medu 10.2271230
## Pstatus       Pstatus  8.4032585
## romantic     romantic  8.2467702
## freetime     freetime  7.7315031
## goout           goout  4.9877363
## sex               sex  4.5833422
## Fedu             Fedu  4.3660366
## studytime   studytime  4.1481797
## famrel         famrel  3.2338396
## activities activities  2.1010025
## famsup         famsup  1.4164631
## address       address  1.1068829
## higher         higher  0.9220471
## internet     internet  0.8971392
## traveltime traveltime  0.2724257

I chose GBM because it is able to handle multicollinearity and complex interactions, it can also show the response curves and relative importance of predictors.

Alcohol use, age, mother’s education, parents’ status, freetime and romantic relationship ostensibly, have the most effect on the cause of absences. Travel time and address seem to be the least

RESPONSE CURVES OF “ABSENCES”

GAM

#now, let's see the response curves of this
#use all the dataset
abse_gam <- gam(absences~ sex + s(age, k=3) +s(Medu, k=3) + 
                  Pstatus+  s(studytime, k=3) +  higher + 
                  internet + goout + s(alc_use, k=3) +  s(G3,                  k=3), data = alco_data, family = "poisson")
plot(abse_gam, pages=1)

Teenage students are more likely to be absent although, there seems not to be enough data for older students above 20, as the confidence interval’ seem high. Also. the likelihood of absences reduces with increase in the level of education of the mother. Studytime also reduces this. While Alcohol increases this. It is quite unsure how grade affects.

Furthermore, I will be expploring this further with GBM reponse curves.

RESPONSE CURVES(GBM) FOR “ABSENCES”

best.iter1<-gbm.perf(abse_gbm1, plot.it = F, method = "OOB")


par(mfrow=c(2,3))
plot.gbm(abse_gbm1, "alc_use", best.iter1)
plot.gbm(abse_gbm1, "age", best.iter1)
plot.gbm(abse_gbm1, "Medu", best.iter1)
plot.gbm(abse_gbm1, "Pstatus", best.iter1)
plot.gbm(abse_gbm1, "freetime", best.iter1)
plot.gbm(abse_gbm1, "romantic", best.iter1)

par(mfrow=c(1,1))

Alcohol use, age mother’s education seem to increase the tendency to be absent while freetime does the opposite. This is quite surprising as, I had expected the exact opposite. Also, students’ with parents apart have more tendency to be absent that those with their parents together. Likewise, students in romantic relationship are more likely to be absent than those without.

PREDICTED VS OBSERVED ABSENCES(COEFFICIENT OF DETERMINATION.)

plot(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences, 
     main="Observed vs Predicted absences")
lines(lowess(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences), col="red", lwd=3)
r_abse <-cor.test(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences)
r2abse <- r_abse$estimate^2
r2abse
##       cor 
## 0.2496773
legend("topleft", paste("r^2=", round(r2abse,3)))

We can see from the scatterplots that the selected variables, account for only 23%(the coefficient of determination) factors affecting the students’ absences.

MODELLING HIGH ALCOHOL CONSUMPTION.

Similar step was repeated here. However, in evaluating the models, I utilised Area Under Curve(AUC), odds ratio and confusion/classification matrix, instead.

## 
## Call:
## glm(formula = high_use ~ sex + age + address + Medu + Fedu + 
##     Pstatus + traveltime + studytime + famsup + activities + 
##     higher + internet + famrel + romantic + freetime + goout + 
##     absences, family = "binomial", data = data_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9022  -0.7522  -0.4473   0.6526   2.6517  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.759454   2.385066  -1.576 0.114969    
## sexM           0.860723   0.291187   2.956 0.003117 ** 
## age            0.080169   0.120926   0.663 0.507357    
## addressU      -0.601436   0.350235  -1.717 0.085936 .  
## Medu          -0.091134   0.169343  -0.538 0.590464    
## Fedu           0.150634   0.164444   0.916 0.359656    
## PstatusT       0.026078   0.456176   0.057 0.954413    
## traveltime     0.281860   0.201586   1.398 0.162050    
## studytime     -0.331499   0.180785  -1.834 0.066704 .  
## famsupyes      0.005161   0.282845   0.018 0.985442    
## activitiesyes -0.477285   0.276610  -1.725 0.084441 .  
## higheryes      0.146713   0.595949   0.246 0.805539    
## internetyes    0.190045   0.393884   0.482 0.629458    
## famrel        -0.419249   0.148466  -2.824 0.004745 ** 
## romanticyes   -0.344881   0.297108  -1.161 0.245725    
## freetime       0.154201   0.146426   1.053 0.292295    
## goout          0.751213   0.133447   5.629 1.81e-08 ***
## absences       0.079062   0.022936   3.447 0.000567 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 360.78  on 364  degrees of freedom
## AIC: 396.78
## 
## Number of Fisher Scoring iterations: 5
## Start:  AIC=396.78
## high_use ~ sex + age + address + Medu + Fedu + Pstatus + traveltime + 
##     studytime + famsup + activities + higher + internet + famrel + 
##     romantic + freetime + goout + absences
## 
##              Df Deviance    AIC
## - famsup      1   360.78 394.78
## - Pstatus     1   360.78 394.78
## - higher      1   360.84 394.84
## - internet    1   361.01 395.01
## - Medu        1   361.07 395.07
## - age         1   361.22 395.22
## - Fedu        1   361.62 395.62
## - freetime    1   361.89 395.89
## - romantic    1   362.15 396.15
## - traveltime  1   362.73 396.73
## <none>            360.78 396.78
## - address     1   363.69 397.69
## - activities  1   363.79 397.79
## - studytime   1   364.29 398.29
## - famrel      1   368.85 402.85
## - sex         1   369.73 403.73
## - absences    1   372.58 406.58
## - goout       1   397.21 431.21
## 
## Step:  AIC=394.78
## high_use ~ sex + age + address + Medu + Fedu + Pstatus + traveltime + 
##     studytime + activities + higher + internet + famrel + romantic + 
##     freetime + goout + absences
## 
##              Df Deviance    AIC
## - Pstatus     1   360.78 392.78
## - higher      1   360.84 392.84
## - internet    1   361.02 393.02
## - Medu        1   361.07 393.07
## - age         1   361.22 393.22
## - Fedu        1   361.64 393.64
## - freetime    1   361.90 393.90
## - romantic    1   362.15 394.15
## - traveltime  1   362.73 394.73
## <none>            360.78 394.78
## - address     1   363.70 395.70
## - activities  1   363.79 395.79
## - studytime   1   364.33 396.33
## + famsup      1   360.78 396.78
## - famrel      1   368.86 400.86
## - sex         1   369.83 401.83
## - absences    1   372.59 404.59
## - goout       1   397.21 429.21
## 
## Step:  AIC=392.78
## high_use ~ sex + age + address + Medu + Fedu + traveltime + studytime + 
##     activities + higher + internet + famrel + romantic + freetime + 
##     goout + absences
## 
##              Df Deviance    AIC
## - higher      1   360.84 390.84
## - internet    1   361.03 391.03
## - Medu        1   361.08 391.08
## - age         1   361.23 391.23
## - Fedu        1   361.64 391.64
## - freetime    1   361.91 391.91
## - romantic    1   362.16 392.16
## - traveltime  1   362.73 392.73
## <none>            360.78 392.78
## - address     1   363.71 393.71
## - activities  1   363.82 393.82
## - studytime   1   364.34 394.34
## + Pstatus     1   360.78 394.78
## + famsup      1   360.78 394.78
## - famrel      1   368.86 398.86
## - sex         1   369.84 399.84
## - absences    1   372.70 402.70
## - goout       1   397.23 427.23
## 
## Step:  AIC=390.84
## high_use ~ sex + age + address + Medu + Fedu + traveltime + studytime + 
##     activities + internet + famrel + romantic + freetime + goout + 
##     absences
## 
##              Df Deviance    AIC
## - internet    1   361.07 389.07
## - Medu        1   361.13 389.13
## - age         1   361.24 389.24
## - Fedu        1   361.74 389.74
## - freetime    1   361.99 389.99
## - romantic    1   362.38 390.38
## - traveltime  1   362.79 390.79
## <none>            360.84 390.84
## - address     1   363.74 391.74
## - activities  1   363.82 391.82
## - studytime   1   364.34 392.34
## + higher      1   360.78 392.78
## + Pstatus     1   360.84 392.84
## + famsup      1   360.84 392.84
## - famrel      1   368.90 396.90
## - sex         1   369.92 397.92
## - absences    1   372.71 400.71
## - goout       1   397.25 425.25
## 
## Step:  AIC=389.07
## high_use ~ sex + age + address + Medu + Fedu + traveltime + studytime + 
##     activities + famrel + romantic + freetime + goout + absences
## 
##              Df Deviance    AIC
## - Medu        1   361.31 387.31
## - age         1   361.47 387.47
## - Fedu        1   361.98 387.98
## - freetime    1   362.31 388.31
## - romantic    1   362.55 388.55
## - traveltime  1   363.01 389.01
## <none>            361.07 389.07
## - address     1   363.76 389.76
## - activities  1   363.96 389.96
## - studytime   1   364.51 390.51
## + internet    1   360.84 390.84
## + higher      1   361.03 391.03
## + Pstatus     1   361.06 391.06
## + famsup      1   361.07 391.07
## - famrel      1   369.07 395.07
## - sex         1   370.27 396.27
## - absences    1   373.49 399.49
## - goout       1   397.68 423.68
## 
## Step:  AIC=387.31
## high_use ~ sex + age + address + Fedu + traveltime + studytime + 
##     activities + famrel + romantic + freetime + goout + absences
## 
##              Df Deviance    AIC
## - age         1   361.74 385.74
## - Fedu        1   362.01 386.01
## - freetime    1   362.51 386.51
## - romantic    1   362.81 386.81
## <none>            361.31 387.31
## - traveltime  1   363.38 387.38
## - address     1   364.12 388.12
## - activities  1   364.28 388.28
## - studytime   1   364.90 388.90
## + Medu        1   361.07 389.07
## + internet    1   361.13 389.13
## + higher      1   361.28 389.28
## + Pstatus     1   361.29 389.29
## + famsup      1   361.31 389.31
## - famrel      1   369.32 393.32
## - sex         1   370.32 394.32
## - absences    1   373.51 397.51
## - goout       1   397.77 421.77
## 
## Step:  AIC=385.74
## high_use ~ sex + address + Fedu + traveltime + studytime + activities + 
##     famrel + romantic + freetime + goout + absences
## 
##              Df Deviance    AIC
## - Fedu        1   362.33 384.33
## - freetime    1   362.85 384.85
## - romantic    1   363.04 385.04
## <none>            361.74 385.74
## - traveltime  1   363.88 385.88
## - activities  1   364.82 386.82
## - address     1   364.95 386.95
## + age         1   361.31 387.31
## - studytime   1   365.32 387.32
## + Medu        1   361.47 387.47
## + internet    1   361.55 387.55
## + Pstatus     1   361.70 387.70
## + higher      1   361.73 387.73
## + famsup      1   361.74 387.74
## - famrel      1   369.56 391.56
## - sex         1   370.89 392.89
## - absences    1   374.53 396.53
## - goout       1   400.71 422.71
## 
## Step:  AIC=384.33
## high_use ~ sex + address + traveltime + studytime + activities + 
##     famrel + romantic + freetime + goout + absences
## 
##              Df Deviance    AIC
## - freetime    1   363.35 383.35
## - romantic    1   363.62 383.62
## - traveltime  1   364.17 384.17
## <none>            362.33 384.33
## - activities  1   365.16 385.16
## - address     1   365.44 385.44
## + Fedu        1   361.74 385.74
## - studytime   1   365.86 385.86
## + age         1   362.01 386.01
## + internet    1   362.09 386.09
## + higher      1   362.29 386.29
## + famsup      1   362.32 386.32
## + Medu        1   362.32 386.32
## + Pstatus     1   362.32 386.32
## - famrel      1   370.21 390.21
## - sex         1   371.74 391.74
## - absences    1   375.32 395.32
## - goout       1   401.89 421.89
## 
## Step:  AIC=383.35
## high_use ~ sex + address + traveltime + studytime + activities + 
##     famrel + romantic + goout + absences
## 
##              Df Deviance    AIC
## - romantic    1   364.45 382.45
## - traveltime  1   365.07 383.07
## <none>            363.35 383.35
## - activities  1   366.01 384.01
## + freetime    1   362.33 384.33
## - address     1   366.39 384.39
## + Fedu        1   362.85 384.85
## + internet    1   363.02 385.02
## + age         1   363.09 385.09
## - studytime   1   367.13 385.13
## + higher      1   363.30 385.30
## + famsup      1   363.32 385.32
## + Pstatus     1   363.32 385.32
## + Medu        1   363.34 385.34
## - famrel      1   370.64 388.64
## - sex         1   373.74 391.74
## - absences    1   375.96 393.96
## - goout       1   409.84 427.84
## 
## Step:  AIC=382.45
## high_use ~ sex + address + traveltime + studytime + activities + 
##     famrel + goout + absences
## 
##              Df Deviance    AIC
## - traveltime  1   366.16 382.16
## <none>            364.45 382.45
## - activities  1   367.21 383.21
## + romantic    1   363.35 383.35
## - address     1   367.42 383.42
## + freetime    1   363.62 383.62
## + Fedu        1   363.95 383.95
## + internet    1   364.19 384.19
## + higher      1   364.27 384.27
## + age         1   364.31 384.31
## - studytime   1   368.37 384.37
## + Pstatus     1   364.41 384.41
## + famsup      1   364.43 384.43
## + Medu        1   364.44 384.44
## - famrel      1   371.46 387.46
## - sex         1   375.05 391.05
## - absences    1   376.50 392.50
## - goout       1   410.69 426.69
## 
## Step:  AIC=382.16
## high_use ~ sex + address + studytime + activities + famrel + 
##     goout + absences
## 
##              Df Deviance    AIC
## <none>            366.16 382.16
## + traveltime  1   364.45 382.45
## - activities  1   368.97 382.97
## + romantic    1   365.07 383.07
## + freetime    1   365.45 383.45
## + Fedu        1   365.92 383.92
## + internet    1   365.95 383.95
## + age         1   365.95 383.95
## + higher      1   366.02 384.02
## + Pstatus     1   366.13 384.13
## + Medu        1   366.14 384.14
## + famsup      1   366.15 384.15
## - studytime   1   370.67 384.67
## - address     1   371.58 385.58
## - famrel      1   373.02 387.02
## - sex         1   376.96 390.96
## - absences    1   377.98 391.98
## - goout       1   413.90 427.90
## 
## Call:  glm(formula = high_use ~ sex + address + studytime + activities + 
##     famrel + goout + absences, family = "binomial", data = data_reg)
## 
## Coefficients:
##   (Intercept)           sexM       addressU      studytime  activitiesyes  
##      -1.35960        0.90337       -0.73174       -0.36291       -0.44562  
##        famrel          goout       absences  
##      -0.37614        0.80515        0.07543  
## 
## Degrees of Freedom: 381 Total (i.e. Null);  374 Residual
## Null Deviance:       465.7 
## Residual Deviance: 366.2     AIC: 382.2

Final model(GLM) for high alcohol use

halc_glm<-glm(high_use ~sex + studytime + goout + absences
              ,data=data_reg,family ="binomial")

anova(halc_glm, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: high_use
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        381     465.68              
## sex        1   14.732       380     450.95 0.0001239 ***
## studytime  1   10.242       379     440.71 0.0013731 ** 
## goout      1   46.759       378     393.95 8.027e-12 ***
## absences   1   12.641       377     381.31 0.0003775 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 1
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2211 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.1994 
## now adding trees... 
## 100   1.1801 
## 150   1.1631 
## 200   1.1477 
## 250   1.134 
## 300   1.1209 
## 350   1.1097 
## 400   1.0984 
## 450   1.0886 
## 500   1.0794 
## 550   1.0707 
## 600   1.063 
## 650   1.0561 
## 700   1.0494 
## 750   1.0433 
## 800   1.0376 
## 850   1.0326 
## 900   1.0279 
## 950   1.0235 
## 1000   1.0196 
## 1050   1.0156 
## 1100   1.012 
## 1150   1.0085 
## 1200   1.0055 
## 1250   1.0027 
## 1300   1.0004 
## 1350   0.9984 
## 1400   0.9966 
## 1450   0.9942 
## 1500   0.9928 
## 1550   0.9917 
## 1600   0.9905 
## 1650   0.9896 
## 1700   0.989 
## 1750   0.988 
## 1800   0.9874 
## 1850   0.9868 
## 1900   0.9868 
## 1950   0.9865 
## 2000   0.9859 
## 2050   0.9855 
## 2100   0.9856 
## 2150   0.9859 
## 2200   0.9861 
## 2250   0.9863 
## 2300   0.9864 
## 2350   0.9869 
## 2400   0.9873 
## 2450   0.988 
## 2500   0.9885 
## 2550   0.9895

## 
## mean total deviance = 1.221 
## mean residual deviance = 0.768 
##  
## estimated cv deviance = 0.985 ; se = 0.043 
##  
## training data correlation = 0.718 
## cv correlation =  0.507 ; se = 0.046 
##  
## training data AUC score = 0.924 
## cv AUC score = 0.74 ; se = 0.035 
##  
## elapsed time -  0.45 minutes 
## [1] 2
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2274 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.2111 
## now adding trees... 
## 100   1.1969 
## 150   1.1844 
## 200   1.1736 
## 250   1.1639 
## 300   1.1556 
## 350   1.148 
## 400   1.1412 
## 450   1.135 
## 500   1.1297 
## 550   1.1246 
## 600   1.1196 
## 650   1.1154 
## 700   1.1119 
## 750   1.1086 
## 800   1.1058 
## 850   1.1033 
## 900   1.0999 
## 950   1.0978 
## 1000   1.0959 
## 1050   1.0938 
## 1100   1.0923 
## 1150   1.0904 
## 1200   1.0896 
## 1250   1.0887 
## 1300   1.0879 
## 1350   1.087 
## 1400   1.0865 
## 1450   1.086 
## 1500   1.0855 
## 1550   1.0849 
## 1600   1.0844 
## 1650   1.0839 
## 1700   1.0837 
## 1750   1.0833 
## 1800   1.0833 
## 1850   1.0832 
## 1900   1.0837 
## 1950   1.0834 
## 2000   1.0836 
## 2050   1.0838 
## 2100   1.0839 
## 2150   1.0842 
## 2200   1.0845 
## 2250   1.0851 
## 2300   1.0855

## 
## mean total deviance = 1.227 
## mean residual deviance = 0.858 
##  
## estimated cv deviance = 1.083 ; se = 0.023 
##  
## training data correlation = 0.671 
## cv correlation =  0.398 ; se = 0.033 
##  
## training data AUC score = 0.901 
## cv AUC score = 0.724 ; se = 0.024 
##  
## elapsed time -  0.44 minutes 
## [1] 3
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2335 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.2126 
## now adding trees... 
## 100   1.1943 
## 150   1.1781 
## 200   1.163 
## 250   1.1494 
## 300   1.1365 
## 350   1.1251 
## 400   1.1144 
## 450   1.1047 
## 500   1.0952 
## 550   1.0865 
## 600   1.0783 
## 650   1.0712 
## 700   1.0649 
## 750   1.0581 
## 800   1.0528 
## 850   1.0479 
## 900   1.0436 
## 950   1.0396 
## 1000   1.0357 
## 1050   1.0324 
## 1100   1.029 
## 1150   1.0259 
## 1200   1.0234 
## 1250   1.0209 
## 1300   1.0186 
## 1350   1.0171 
## 1400   1.0154 
## 1450   1.0137 
## 1500   1.012 
## 1550   1.0106 
## 1600   1.0095 
## 1650   1.0087 
## 1700   1.0081 
## 1750   1.0079 
## 1800   1.0067 
## 1850   1.0064 
## 1900   1.0065 
## 1950   1.0062 
## 2000   1.0056 
## 2050   1.0054 
## 2100   1.0053 
## 2150   1.0055 
## 2200   1.0056 
## 2250   1.0061 
## 2300   1.0061 
## 2350   1.0067 
## 2400   1.007 
## 2450   1.0079 
## 2500   1.0085

## 
## mean total deviance = 1.234 
## mean residual deviance = 0.758 
##  
## estimated cv deviance = 1.005 ; se = 0.047 
##  
## training data correlation = 0.731 
## cv correlation =  0.475 ; se = 0.05 
##  
## training data AUC score = 0.93 
## cv AUC score = 0.778 ; se = 0.03 
##  
## elapsed time -  0.48 minutes 
## [1] 4
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.1946 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.1734 
## now adding trees... 
## 100   1.1551 
## 150   1.1388 
## 200   1.1237 
## 250   1.1107 
## 300   1.0991 
## 350   1.0883 
## 400   1.0787 
## 450   1.0696 
## 500   1.0616 
## 550   1.054 
## 600   1.0468 
## 650   1.0401 
## 700   1.0341 
## 750   1.0283 
## 800   1.0228 
## 850   1.0176 
## 900   1.0126 
## 950   1.0079 
## 1000   1.0036 
## 1050   1 
## 1100   0.9965 
## 1150   0.9931 
## 1200   0.9898 
## 1250   0.9868 
## 1300   0.9836 
## 1350   0.9807 
## 1400   0.9783 
## 1450   0.9761 
## 1500   0.9742 
## 1550   0.9721 
## 1600   0.9706 
## 1650   0.9688 
## 1700   0.967 
## 1750   0.9658 
## 1800   0.9647 
## 1850   0.9635 
## 1900   0.9625 
## 1950   0.9613 
## 2000   0.9602 
## 2050   0.9594 
## 2100   0.9585 
## 2150   0.9578 
## 2200   0.957 
## 2250   0.9565 
## 2300   0.9565 
## 2350   0.9563 
## 2400   0.9559 
## 2450   0.9552 
## 2500   0.9549 
## 2550   0.9546 
## 2600   0.9548 
## 2650   0.9548 
## 2700   0.9554 
## 2750   0.9553 
## 2800   0.9554 
## 2850   0.9553 
## 2900   0.9554 
## 2950   0.9554 
## 3000   0.9558

## 
## mean total deviance = 1.195 
## mean residual deviance = 0.667 
##  
## estimated cv deviance = 0.955 ; se = 0.033 
##  
## training data correlation = 0.765 
## cv correlation =  0.509 ; se = 0.033 
##  
## training data AUC score = 0.941 
## cv AUC score = 0.798 ; se = 0.021 
##  
## elapsed time -  0.59 minutes 
## 
##  ########### warning ########## 
##  
## maximum tree limit reached - results may not be optimal 
##   - refit with faster learning rate or increase maximum number of trees 
## [1] 5
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2335 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.2134 
## now adding trees... 
## 100   1.1956 
## 150   1.1804 
## 200   1.1666 
## 250   1.1544 
## 300   1.1438 
## 350   1.1345 
## 400   1.1259 
## 450   1.1184 
## 500   1.1118 
## 550   1.1058 
## 600   1.1006 
## 650   1.0958 
## 700   1.0913 
## 750   1.0871 
## 800   1.0832 
## 850   1.08 
## 900   1.0775 
## 950   1.0747 
## 1000   1.0721 
## 1050   1.0701 
## 1100   1.0681 
## 1150   1.0666 
## 1200   1.0651 
## 1250   1.0634 
## 1300   1.0622 
## 1350   1.0614 
## 1400   1.0601 
## 1450   1.0594 
## 1500   1.0592 
## 1550   1.0583 
## 1600   1.0573 
## 1650   1.0571 
## 1700   1.057 
## 1750   1.0564 
## 1800   1.0562 
## 1850   1.0563 
## 1900   1.0564 
## 1950   1.0566 
## 2000   1.0571 
## 2050   1.0573 
## 2100   1.0574 
## 2150   1.0576 
## 2200   1.0585 
## 2250   1.0587 
## 2300   1.0593

## 
## mean total deviance = 1.234 
## mean residual deviance = 0.833 
##  
## estimated cv deviance = 1.056 ; se = 0.037 
##  
## training data correlation = 0.685 
## cv correlation =  0.44 ; se = 0.046 
##  
## training data AUC score = 0.913 
## cv AUC score = 0.717 ; se = 0.034 
##  
## elapsed time -  0.44 minutes 
## [1] 6
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2211 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.2042 
## now adding trees... 
## 100   1.189 
## 150   1.1753 
## 200   1.1624 
## 250   1.1508 
## 300   1.1399 
## 350   1.1301 
## 400   1.1209 
## 450   1.1123 
## 500   1.1043 
## 550   1.0969 
## 600   1.09 
## 650   1.0834 
## 700   1.0771 
## 750   1.0718 
## 800   1.0668 
## 850   1.062 
## 900   1.0574 
## 950   1.0535 
## 1000   1.05 
## 1050   1.0466 
## 1100   1.0433 
## 1150   1.0401 
## 1200   1.0372 
## 1250   1.0347 
## 1300   1.0326 
## 1350   1.0303 
## 1400   1.0281 
## 1450   1.0257 
## 1500   1.0245 
## 1550   1.0232 
## 1600   1.0219 
## 1650   1.0207 
## 1700   1.0196 
## 1750   1.0184 
## 1800   1.0172 
## 1850   1.0163 
## 1900   1.0159 
## 1950   1.0152 
## 2000   1.0145 
## 2050   1.0143 
## 2100   1.0138 
## 2150   1.0137 
## 2200   1.0136 
## 2250   1.0133 
## 2300   1.0134 
## 2350   1.0136 
## 2400   1.0137 
## 2450   1.014 
## 2500   1.0142 
## 2550   1.0143 
## 2600   1.0153 
## 2650   1.0157 
## 2700   1.0159

## 
## mean total deviance = 1.221 
## mean residual deviance = 0.749 
##  
## estimated cv deviance = 1.013 ; se = 0.048 
##  
## training data correlation = 0.732 
## cv correlation =  0.462 ; se = 0.055 
##  
## training data AUC score = 0.928 
## cv AUC score = 0.763 ; se = 0.026 
##  
## elapsed time -  0.52 minutes 
## [1] 7
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2335 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.217 
## now adding trees... 
## 100   1.2024 
## 150   1.1888 
## 200   1.1772 
## 250   1.1667 
## 300   1.1563 
## 350   1.1469 
## 400   1.1385 
## 450   1.1309 
## 500   1.1237 
## 550   1.1168 
## 600   1.111 
## 650   1.1058 
## 700   1.1005 
## 750   1.0957 
## 800   1.0911 
## 850   1.0866 
## 900   1.083 
## 950   1.0795 
## 1000   1.0761 
## 1050   1.0733 
## 1100   1.0705 
## 1150   1.0679 
## 1200   1.0656 
## 1250   1.0635 
## 1300   1.0609 
## 1350   1.0588 
## 1400   1.0567 
## 1450   1.0544 
## 1500   1.0529 
## 1550   1.0511 
## 1600   1.0503 
## 1650   1.0493 
## 1700   1.0479 
## 1750   1.0469 
## 1800   1.0467 
## 1850   1.0465 
## 1900   1.0459 
## 1950   1.0455 
## 2000   1.0447 
## 2050   1.0444 
## 2100   1.0443 
## 2150   1.0444 
## 2200   1.0441 
## 2250   1.044 
## 2300   1.0446 
## 2350   1.0443 
## 2400   1.0449 
## 2450   1.0447 
## 2500   1.0444 
## 2550   1.0447 
## 2600   1.045 
## 2650   1.0454

## 
## mean total deviance = 1.234 
## mean residual deviance = 0.779 
##  
## estimated cv deviance = 1.044 ; se = 0.053 
##  
## training data correlation = 0.716 
## cv correlation =  0.443 ; se = 0.062 
##  
## training data AUC score = 0.917 
## cv AUC score = 0.752 ; se = 0.039 
##  
## elapsed time -  0.51 minutes 
## [1] 8
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2396 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.2206 
## now adding trees... 
## 100   1.2045 
## 150   1.1903 
## 200   1.1772 
## 250   1.1655 
## 300   1.1555 
## 350   1.1466 
## 400   1.1388 
## 450   1.1323 
## 500   1.1262 
## 550   1.121 
## 600   1.1166 
## 650   1.1126 
## 700   1.1088 
## 750   1.1049 
## 800   1.1017 
## 850   1.0986 
## 900   1.0958 
## 950   1.0936 
## 1000   1.0913 
## 1050   1.0892 
## 1100   1.0872 
## 1150   1.0858 
## 1200   1.0842 
## 1250   1.0826 
## 1300   1.0817 
## 1350   1.0805 
## 1400   1.08 
## 1450   1.0795 
## 1500   1.0788 
## 1550   1.0781 
## 1600   1.0779 
## 1650   1.0776 
## 1700   1.0771 
## 1750   1.0771 
## 1800   1.077 
## 1850   1.0774 
## 1900   1.0776 
## 1950   1.0775 
## 2000   1.0775 
## 2050   1.0777 
## 2100   1.0781 
## 2150   1.0782 
## 2200   1.0781 
## 2250   1.0787

## 
## mean total deviance = 1.24 
## mean residual deviance = 0.853 
##  
## estimated cv deviance = 1.077 ; se = 0.059 
##  
## training data correlation = 0.674 
## cv correlation =  0.403 ; se = 0.083 
##  
## training data AUC score = 0.896 
## cv AUC score = 0.725 ; se = 0.047 
##  
## elapsed time -  0.43 minutes 
## [1] 9
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2455 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.2281 
## now adding trees... 
## 100   1.2122 
## 150   1.1983 
## 200   1.1866 
## 250   1.1755 
## 300   1.1659 
## 350   1.1571 
## 400   1.1492 
## 450   1.1424 
## 500   1.1355 
## 550   1.1296 
## 600   1.1242 
## 650   1.1194 
## 700   1.1146 
## 750   1.1105 
## 800   1.107 
## 850   1.1033 
## 900   1.0996 
## 950   1.0972 
## 1000   1.0944 
## 1050   1.0921 
## 1100   1.0897 
## 1150   1.088 
## 1200   1.0865 
## 1250   1.085 
## 1300   1.0841 
## 1350   1.0832 
## 1400   1.0821 
## 1450   1.0813 
## 1500   1.0808 
## 1550   1.0798 
## 1600   1.0795 
## 1650   1.0791 
## 1700   1.0789 
## 1750   1.0788 
## 1800   1.079 
## 1850   1.0794 
## 1900   1.0796 
## 1950   1.0799 
## 2000   1.0802 
## 2050   1.081 
## 2100   1.0816 
## 2150   1.0822 
## 2200   1.0824

## 
## mean total deviance = 1.245 
## mean residual deviance = 0.852 
##  
## estimated cv deviance = 1.079 ; se = 0.052 
##  
## training data correlation = 0.682 
## cv correlation =  0.416 ; se = 0.072 
##  
## training data AUC score = 0.897 
## cv AUC score = 0.714 ; se = 0.04 
##  
## elapsed time -  0.42 minutes 
## [1] 10
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for NA and using a family of bernoulli 
## Using 267 observations and 17 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2274 
## tolerance is fixed at  0.0012 
## ntrees resid. dev. 
## 50    1.2034 
## now adding trees... 
## 100   1.1822 
## 150   1.1637 
## 200   1.1478 
## 250   1.1328 
## 300   1.1192 
## 350   1.1068 
## 400   1.0954 
## 450   1.0852 
## 500   1.0759 
## 550   1.0678 
## 600   1.0598 
## 650   1.0524 
## 700   1.0459 
## 750   1.0404 
## 800   1.0358 
## 850   1.0311 
## 900   1.0271 
## 950   1.0237 
## 1000   1.0207 
## 1050   1.0174 
## 1100   1.015 
## 1150   1.0126 
## 1200   1.0103 
## 1250   1.0083 
## 1300   1.0063 
## 1350   1.0048 
## 1400   1.0033 
## 1450   1.0023 
## 1500   1.0011 
## 1550   0.9993 
## 1600   0.9988 
## 1650   0.9981 
## 1700   0.9976 
## 1750   0.9973 
## 1800   0.9966 
## 1850   0.9965 
## 1900   0.9962 
## 1950   0.9958 
## 2000   0.9957 
## 2050   0.9959 
## 2100   0.9959 
## 2150   0.9962 
## 2200   0.9966 
## 2250   0.997 
## 2300   0.9971 
## 2350   0.997 
## 2400   0.9974 
## 2450   0.9981

## 
## mean total deviance = 1.227 
## mean residual deviance = 0.762 
##  
## estimated cv deviance = 0.996 ; se = 0.065 
##  
## training data correlation = 0.719 
## cv correlation =  0.486 ; se = 0.076 
##  
## training data AUC score = 0.939 
## cv AUC score = 0.768 ; se = 0.039 
##  
## elapsed time -  0.39 minutes
# AUC values through the various iterations.
compared_model_halc
##    halc_auc_glm halc_auc_gam halc_auc_gbm1 halc_auc_gbm2
## 1     0.7755991    0.7705156     0.7418301     0.7447349
## 2     0.8534738    0.8534738     0.8100517     0.8215078
## 3     0.7142319    0.7142319     0.7164910     0.7198795
## 4     0.7127478    0.7127478     0.6866029     0.6746411
## 5     0.7863328    0.7855798     0.7865211     0.8045934
## 6     0.7516340    0.7523602     0.7734205     0.7799564
## 7     0.7432229    0.7443524     0.7951807     0.7725904
## 8     0.7991551    0.7953149     0.8206605     0.8294931
## 9     0.7668627    0.7625490     0.7917647     0.7854902
## 10    0.7390983    0.7431633     0.6906874     0.7250554
#The mean AUC values for the various models
mean_auc_halc<-colMeans(compared_model_halc)
#  attach(compared_model_halc)
#This was used temporarily to see if other models are significantly better.

# wilcox.test(halc_auc_gbm1, halc_auc_gam, paird=T)

#let's see the mean auc values for all the models
mean_auc_halc
##  halc_auc_glm  halc_auc_gam halc_auc_gbm1 halc_auc_gbm2 
##     0.7642358     0.7634289     0.7613211     0.7657942

ROC(AUC) curves of the various models.

#show the AUC curves from the last iteration on the test data
par(mfrow=c(2,2))
colAUC(halc_glm_pred, eva$high_use , plotROC = T)
##                     [,1]
## FALSE vs. TRUE 0.7390983
colAUC(pred_halc_gam, eva$high_use , plotROC = T)
##                     [,1]
## FALSE vs. TRUE 0.7431633
colAUC(pred_halc_gbm1, eva$high_use , plotROC = T)
##                     [,1]
## FALSE vs. TRUE 0.6906874
colAUC(pred_halc_gbm2, eva$high_use , plotROC = T)

##                     [,1]
## FALSE vs. TRUE 0.7250554
par(mfrow=c(1,1))

Here, I utilised Area Under Curve for evaluating my model. This is because it prevents subjectiveness in selecting thresholds which can significantly affect the predictive performance and commission and omission error. Although, measures, such as prevalence have been recommended for dealing with selection of threshold to conver into binary(e.g, True or False). AUC readily takes care of this by considering all the possible thresholds and evaluates the performance, accordingly. a value of >0.9 is an excellent model while AUC values with range 0.7-0.8, 0.5-0.7, 0.5 are fairly good, poor and very poor respectively. Here, my AUC values for all the models thorugh my boosting and resampling are about 0.7 for the three different models(GLm, GAM, GBM) which I applied.

Model for alcohol use

RESPONSE CURVES AND MODEL INTERPRETATIONS.

####################################################
#Model for alcohol use
#RESPONSE CURVES AND MODEL INTERPRETATIONS.
#Here, i decided to use the entire data for the analysis
#GAM
halc_gam<-gam(high_use~ sex+ s(studytime, k=3) + s(goout, k=3) + 
                s(absences, k=3) , data =data_reg, family = "binomial")
summary(halc_gam)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## high_use ~ sex + s(studytime, k = 3) + s(goout, k = 3) + s(absences, 
##     k = 3)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.4490     0.1983  -7.308 2.72e-13 ***
## sexM          0.7817     0.2656   2.944  0.00324 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df Chi.sq  p-value    
## s(studytime)   1      1  6.095   0.0136 *  
## s(goout)       1      1 36.715 1.37e-09 ***
## s(absences)    1      1 12.118   0.0005 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.227   Deviance explained = 18.1%
## UBRE = 0.024361  Scale est. = 1         n = 382
#plot the response curves from GAM
plot(halc_gam, pages=1)

From the above, we can see the response of high alcohol use to the various predictors and also the confidence interval. There appears to be lesser tendency of high alcohol use when study time increases. This is expected, as the student has lesser time for social activities and parties. on the other hand, expectedly, going out increases the tendency of high alcohol use. In similar vein, absence from school increases the tendency too. but as it can be seen from the plot, the confidence interval seems to reduce with increase in number of absences, which might be an indication of insufficient data from that region.

To get a deeper, insight, I will explore this further with GBM

#GBM
halc_gbm1<-gbm(formula = high_use~ sex + age+address+Medu+Fedu+
                 Pstatus+ traveltime+studytime+famsup+activities+higher+
                 internet+famrel+romantic+freetime+goout+                          absences, data=data_reg,
               distribution = "bernoulli",n.trees = 2800,                       shrinkage = 0.001, interaction.depth = 6,
               bag.fraction = 0.75)
summary(halc_gbm1)

##                   var    rel.inf
## goout           goout 29.7986228
## absences     absences 18.6109317
## sex               sex 13.6762991
## famrel         famrel  7.6715637
## traveltime traveltime  6.7131311
## Medu             Medu  4.0753433
## studytime   studytime  3.6682986
## freetime     freetime  3.3699362
## age               age  2.6712668
## activities activities  2.5153293
## Fedu             Fedu  2.2009417
## address       address  1.9756887
## famsup         famsup  1.4060125
## romantic     romantic  1.0463254
## internet     internet  0.3687841
## Pstatus       Pstatus  0.1251636
## higher         higher  0.1063612

From the relative importance, we can see that goout, absences, sex and family relationship seem to have the highest effect on high alcohol use. The least important factors can also be seen from the model summary.

Now, i’ll show the response curve from GBM to see the effects

#Now, i'll show the response curve from GBM to see the effects

#plot(halc_gbm1)
best.iter1<-gbm.perf(halc_gbm1, plot.it = F, method = "OOB")

# pred_halc_gbm1<-predict.gbm(halc_gbm1,newdata = eva, best.iter1, type = "response")

par(mfrow=c(2,2))
plot.gbm(halc_gbm1, "goout", best.iter1)
plot.gbm(halc_gbm1, "absences", best.iter1)
plot.gbm(halc_gbm1, "sex", best.iter1)
plot.gbm(halc_gbm1, "famrel", best.iter1)

par(mfrow=c(1,1))
plot.gbm(halc_gbm1, "studytime", best.iter1)

As it can also be seen from the GM reponse curves, absences and going out, increases the tendency of high alcohol use. Going out steeply affects when it is more than average. Absences of slighty higher than 10 times can even increase the tendency. Family relationship however, reduces the tendency. and overall, male students seem to have relatively more high alcohol use than their female counterpart. Also, as shown earlier, more study time appears to reduce the tendency of high alcohol use.

ODDS RATIO

#explore the odd's ratio
halc_glm_mod <- glm(high_use~  sex + studytime + goout + absences, data=data_reg, family = "binomial") 
odds_ra <- exp(coef(halc_glm_mod))
#odds_ra <- coef(high_use_mod2) %>% exp     #alternaive

# compute confidence intervals (conf_int)
conf_int <- exp(confint(halc_glm_mod)) 
#Conf_Int <- high_use_mod2 %>%  confint() %>% exp   #alternative

# print out the odds ratios with their confidence intervals
cbind(odds_ra, conf_int)
##                odds_ra      2.5 %    97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## sexM        2.18525582 1.30370562 3.7010763
## studytime   0.65628388 0.46510383 0.9099505
## goout       2.06838526 1.64470314 2.6349716
## absences    1.08123884 1.03577383 1.1324143

Here, we can also see that the odd’s ratio of Male students, absences and going out are more than 1, which indicates success: that they all increase the tendency of high alcohol consumption by students. This is not surprising. Absence from school and going out would normally be expected to result in high alcohol use. The confidence interval shows that it is highly likely that these factors affect. On the other hand, study time shows failure with high alcohol consumption which is also not surprising, because when studying more, one would expect that students have less time for going out and getting drunk. These results are all in accordance with the results of the models used earlier.

CONFUSION MATRIX

ERROR IN PREDICTD HIGH ALCOHOL USE. Although, AUC is preferred, as it considers all possible threshold and prevent subjectiveness, I will be exploring a confusion matrix too and 0.5 will be made the threshold for TRUE or FALSE for high alcohol use. More information on creating functions to calculate these metrics in a more automated way can be found here

#fit the model
# predict() the probability of high_use
probs<- predict(halc_glm_mod, type = "response")

#Copy the data again
alc<-data_reg

#add the predicted probabilities to 'alc'
alc$prob_high_use <- probs
#alc <- mutate(alc, probability = probabilities)

#use the probabilities to make a prediction of high_use, setting 0.5 as threshold
alc$predict_high_use<- (alc$prob_high_use)>0.5
#alc <- mutate(alc, prediction = prob_high_use>0.5)

#see the first ten and last ten original classes, predicted probabilities, and class predictions
head(alc[,c("failures", "absences", "sex", "high_use", "prob_high_use", "predict_high_use")], n=10)
##    failures absences sex high_use prob_high_use predict_high_use
## 1         0        3   F    FALSE    0.03910480            FALSE
## 2         1        2   F     TRUE    0.27212509            FALSE
## 3         0        8   F    FALSE    0.09326837            FALSE
## 4         0        2   F    FALSE    0.05424023            FALSE
## 5         1        5   F     TRUE    0.03386206            FALSE
## 6         0        2   F    FALSE    0.05424023            FALSE
## 7         1        0   F    FALSE    0.04676258            FALSE
## 8         0        1   F    FALSE    0.14322690            FALSE
## 9         0        9   F     TRUE    0.06105537            FALSE
## 10        0       10   F    FALSE    0.12696107            FALSE
#tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$predict_high_use)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   16
##    TRUE     65   49
#Here, I derived the various classification matrix's metrics
#Accuracy: Overall, how often is the classifier correct?
#(TP+TN)/total
accuracy<- (252 + 49)/nrow(alc)
print(paste("The accuracy of the classification is", accuracy))
## [1] "The accuracy of the classification is 0.787958115183246"
#Error rate
print(paste("The error rate/misclassification is", 1-accuracy))
## [1] "The error rate/misclassification is 0.212041884816754"
#True Positive Rate:
#"Sensitivity" or "Recall": When actually yes, how often does it predict yes?
#(TP/actual yes)
print(paste("The True positive rate/Sensitivity is", 49/(65+49)))
## [1] "The True positive rate/Sensitivity is 0.429824561403509"
#False Positive Rate: i.e When actually no, how often does it predict yes?
print(paste("The false positive rate is", 16/(16+252)))
## [1] "The false positive rate is 0.0597014925373134"
#Specificity: When actually no, how often does it predict no?
#TN/actual no
#Alternatively: 1 minus False Positive Rate
print(paste("The Specificity is", 252/(252+16)))
## [1] "The Specificity is 0.940298507462687"
#Precision: When it predicts yes, how often is it correct?
#TP/predicted yes
print(paste("The precision is", 49/(16+49)))
## [1] "The precision is 0.753846153846154"
#Prevalence: How often does the yes condition actually occur in our sample?
#actual yes/total
print(paste("The prevalence is", (65+49)/nrow(alc)))
## [1] "The prevalence is 0.298429319371728"

Accuracy of the prediction is about 0.79 which is fairly good. Miclassification/error rate is about 0.21 By using threshold of 0.5, we can see that 252 FALSE high alcohol use were Truly/rightly predicted and 16 falsely predicted. There are also 49 high alcohol use that were Truly predicted while 65, wrongly predicted. The prediction seems to be better when predicting no high alcohol use(i.e Specifity) than when predicting the true value(sensitivity. This can also be seen in the plot below.

Plotting the error of prediction.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = prob_high_use, y = high_use, col= predict_high_use))

# define the geom as points and draw the plot
g + geom_point()

ALTERNATIVE FOR CALULATING THE ERROR

#Proportion of errors
conf_mat<-table(high_use = alc$high_use, prediction = alc$predict_high_use)
conf_mat<-prop.table(conf_mat)
addmargins(conf_mat)
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65968586 0.04188482 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.82984293 0.17015707 1.00000000
#mean error of the prediction
mean(abs(alc$high_use-alc$prob_high_use)>0.5)
## [1] 0.2120419
#The below is an alternative way by firstly defining the function to calculate the mean error.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob_high_use)
## [1] 0.2120419

K-fold cross-validation

#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = halc_glm_mod, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2094241

Here I used an arbitrary threshold of 0.5 but it has been suggested that prevalence can be used in selection of threshold. you can see more here about terminologies related to confusion/classification matrix.

LINEAR DISCRIMINANT ANALYSIS(LDA).

###############################################################
#LINEAR DISCRIMINANT ANALYSIS.


#copy data again
data_copy2<- alco_data

#filter the numeric variables.
data_num<-Filter(is.numeric, data_copy2)

#Scale the numeric variables
data_num_sca<- data.frame( scale(data_num))

#get the categorised grades created earlier into the vector
G3_cat<-data_mca_fac2[,c("G3")]

#add to to the scaled data frame
data_num_G3<- cbind(data_num_sca, G3_cat)

#summary
summary(data_num_G3)
##       age               Medu              Fedu           traveltime     
##  Min.   :-1.3519   Min.   :-2.5831   Min.   :-2.3402   Min.   :-0.6434  
##  1st Qu.:-0.4997   1st Qu.:-0.7422   1st Qu.:-0.5158   1st Qu.:-0.6434  
##  Median : 0.3525   Median : 0.1783   Median : 0.3964   Median :-0.6434  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3525   3rd Qu.: 1.0988   3rd Qu.: 1.3086   3rd Qu.: 0.7939  
##  Max.   : 4.6133   Max.   : 1.0988   Max.   : 1.3086   Max.   : 3.6683  
##    studytime           failures           famrel            freetime      
##  Min.   :-1.23721   Min.   :-0.3458   Min.   :-3.24319   Min.   :-2.2541  
##  1st Qu.:-1.23721   1st Qu.:-0.3458   1st Qu.: 0.06937   1st Qu.:-0.2233  
##  Median :-0.04374   Median :-0.3458   Median : 0.06937   Median :-0.2233  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.:-0.04374   3rd Qu.:-0.3458   3rd Qu.: 1.17356   3rd Qu.: 0.7921  
##  Max.   : 2.34320   Max.   : 4.8004   Max.   : 1.17356   Max.   : 1.8075  
##      goout              Dalc              Walc             health       
##  Min.   :-1.8897   Min.   :-0.5434   Min.   :-1.0162   Min.   :-1.8397  
##  1st Qu.:-0.9952   1st Qu.:-0.5434   1st Qu.:-1.0162   1st Qu.:-0.4099  
##  Median :-0.1007   Median :-0.5434   Median :-0.2320   Median : 0.3051  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7938   3rd Qu.: 0.5847   3rd Qu.: 0.5522   3rd Qu.: 1.0200  
##  Max.   : 1.6883   Max.   : 3.9691   Max.   : 2.1206   Max.   : 1.0200  
##     absences             G1                G2                G3         
##  Min.   :-0.8238   Min.   :-3.5147   Min.   :-2.6409   Min.   :-3.4614  
##  1st Qu.:-0.6407   1st Qu.:-0.5517   1st Qu.:-0.5185   1st Qu.:-0.4405  
##  Median :-0.2746   Median : 0.1891   Median : 0.1889   Median : 0.1637  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2746   3rd Qu.: 0.9298   3rd Qu.: 0.8963   3rd Qu.: 0.7679  
##  Max.   : 7.4139   Max.   : 2.4113   Max.   : 2.3112   Max.   : 1.9763  
##     alc_use          G3_cat   
##  Min.   :-0.9024   vlow :143  
##  1st Qu.:-0.9024   low  :107  
##  Median :-0.3947   high : 64  
##  Mean   : 0.0000   vhigh: 68  
##  3rd Qu.: 0.6207              
##  Max.   : 3.1592

**DIVIDE DATA INTO TEST AND TRAIN DATA FOR VALIDATION.

#now, divide into 80:20 train:test data
samp <- sample(1:nrow(data_num_G3), size = nrow(data_num_G3)*0.8)
train <- data_num_G3[samp,]
test <- data_num_G3[-samp,]


#Dimension.
dim(train)
## [1] 305  18
#remove the grades columns
train$G1<-train$G2<-train$G3<-NULL

#check the column names now
colnames(train)
##  [1] "age"        "Medu"       "Fedu"       "traveltime" "studytime" 
##  [6] "failures"   "famrel"     "freetime"   "goout"      "Dalc"      
## [11] "Walc"       "health"     "absences"   "alc_use"    "G3_cat"

fitting the categorised grade to all other variables

#fit the categorised grade to all other variables
lda.fit <- lda(G3_cat~., data = train)
#lda.fit

#create arrow function
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

#convert the categorised grade into numeric for the LDA plot
train$G3_cat <- as.numeric(train$G3_cat)

#plot
plot(lda.fit, dimen = 2, col = train$G3, pch= train$G3)
lda.arrows(lda.fit, myscale = 2)

failures in the past is contributing most to poor grades. Absences also appear to be. Mother’s education level seems to be a factor contributing to students’ success.

LDA CLASSIFICATION ACCURACY

#see the class prediction accuracy
G3_cat<-test$G3_cat
test<-dplyr::select(test, -G3_cat)
lda.pred<-predict(lda.fit, newdata = test)
table(correct = G3_cat, predicted = lda.pred$class)
##        predicted
## correct vlow low high vhigh
##   vlow    16   9    0     2
##   low     14   9    0     2
##   high     7   4    4     1
##   vhigh    5   1    0     3

The table above shows the right classification and misclassification.

Distance between variables.

#Now, see the distance between variables.
dist_sca<-dist(data_num_sca)
summary(dist_sca)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5492  4.4994  5.4263  5.6181  6.5457 13.4290

Next, cluster the data better using k-means clustering and refit the LDA using the clustered data.

First, determine the optimal number of clusters using the elbow method of TWCSS.

Distance between variables.

set.seed(123) #This is used to make the values constant when re-run
# determine the number of clusters
k_max <- 20

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(data_num_sca, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = c('point', 'line'))

############################################################
#some other methods of exploring optimal number of classes
# library(cluster)
# library(factoextra)
# # reVisualize k-means clusters
# km.res <- kmeans(data_num_sca, 3, nstart = 25)
# fviz_cluster(km.res, data = data_num_sca, geom = "point",
#              stand = FALSE, frame.type = "norm")
# 
# #install.packages("factoextra")
# require(cluster)
# fviz_nbclust(data_num_sca, kmeans, method = "silhouette")
#############################################################

**Further, use the NbClust to confirm. There are quite many others too. Two others are in the previous code chunk where the packages “cluster” and “factoextra” can be used. See more here for other cluster methods of checking the optimal number of classes.

#install.packages("cluster")
library(NbClust)
nb <- NbClust(data_num_sca, diss=NULL, distance = "euclidean", 
              min.nc=2, max.nc=5, method = "kmeans", 
              index = "all", alphaBeale = 0.1)

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 13 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
#Based on these, I will make my number of clusters, 3

# k-means clustering
#use the euclidean distance calculated earlier.
km <-kmeans(dist_sca, centers = 3)

#perform LDA on the clustered data
lda.fit_clus<- lda(km$cluster~., data=data_num_sca)


plot(lda.fit_clus, dimen = 2, col = km$cluster, pch= km$cluster)
lda.arrows(lda.fit_clus, myscale = 2)

Seems past failure and alcohol use also causes poor performance

#copydata again
data_copy2<-alco_data
data_num2<-Filter(is.numeric, data_copy2)
data_LDA1_sca<- data.frame( scale(data_num2))
summary(data_LDA1_sca)
##       age               Medu              Fedu           traveltime     
##  Min.   :-1.3519   Min.   :-2.5831   Min.   :-2.3402   Min.   :-0.6434  
##  1st Qu.:-0.4997   1st Qu.:-0.7422   1st Qu.:-0.5158   1st Qu.:-0.6434  
##  Median : 0.3525   Median : 0.1783   Median : 0.3964   Median :-0.6434  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3525   3rd Qu.: 1.0988   3rd Qu.: 1.3086   3rd Qu.: 0.7939  
##  Max.   : 4.6133   Max.   : 1.0988   Max.   : 1.3086   Max.   : 3.6683  
##    studytime           failures           famrel            freetime      
##  Min.   :-1.23721   Min.   :-0.3458   Min.   :-3.24319   Min.   :-2.2541  
##  1st Qu.:-1.23721   1st Qu.:-0.3458   1st Qu.: 0.06937   1st Qu.:-0.2233  
##  Median :-0.04374   Median :-0.3458   Median : 0.06937   Median :-0.2233  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.:-0.04374   3rd Qu.:-0.3458   3rd Qu.: 1.17356   3rd Qu.: 0.7921  
##  Max.   : 2.34320   Max.   : 4.8004   Max.   : 1.17356   Max.   : 1.8075  
##      goout              Dalc              Walc             health       
##  Min.   :-1.8897   Min.   :-0.5434   Min.   :-1.0162   Min.   :-1.8397  
##  1st Qu.:-0.9952   1st Qu.:-0.5434   1st Qu.:-1.0162   1st Qu.:-0.4099  
##  Median :-0.1007   Median :-0.5434   Median :-0.2320   Median : 0.3051  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7938   3rd Qu.: 0.5847   3rd Qu.: 0.5522   3rd Qu.: 1.0200  
##  Max.   : 1.6883   Max.   : 3.9691   Max.   : 2.1206   Max.   : 1.0200  
##     absences             G1                G2                G3         
##  Min.   :-0.8238   Min.   :-3.5147   Min.   :-2.6409   Min.   :-3.4614  
##  1st Qu.:-0.6407   1st Qu.:-0.5517   1st Qu.:-0.5185   1st Qu.:-0.4405  
##  Median :-0.2746   Median : 0.1891   Median : 0.1889   Median : 0.1637  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2746   3rd Qu.: 0.9298   3rd Qu.: 0.8963   3rd Qu.: 0.7679  
##  Max.   : 7.4139   Max.   : 2.4113   Max.   : 2.3112   Max.   : 1.9763  
##     alc_use       
##  Min.   :-0.9024  
##  1st Qu.:-0.9024  
##  Median :-0.3947  
##  Mean   : 0.0000  
##  3rd Qu.: 0.6207  
##  Max.   : 3.1592
#removwe the alcohol columns
data_LDA1_sca$alc_use<-data_LDA1_sca$Walc<-data_LDA1_sca$Dalc<-NULL

#see colmn names
colnames(data_LDA1_sca)
##  [1] "age"        "Medu"       "Fedu"       "traveltime" "studytime" 
##  [6] "failures"   "famrel"     "freetime"   "goout"      "health"    
## [11] "absences"   "G1"         "G2"         "G3"
#Get the alcohol use into a cector
alc_use<-alco_data[,c("alc_use")]

#combine with the scaled data
data_LDA2<- cbind(data_LDA1_sca, alc_use)


#summary(data_LDA2)

lda.fit2 <- lda(alc_use~., data = data_LDA2)
#lda.fit


#first, determine the optimal number of clusters using twcss.
set.seed(123)
# determine the number of clusters
k_max <- 10
alc_sca<-as.data.frame(scale(data_LDA2))
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(alc_sca, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
dist_alc<- dist(alc_sca)
km <-kmeans(dist_sca, centers = 4)


#perform LDA on the clustered data
lda.fit_clus<- lda(km$cluster~., data=alc_sca)
lda.fit_clus
## Call:
## lda(km$cluster ~ ., data = alc_sca)
## 
## Prior probabilities of groups:
##          1          2          3          4 
## 0.24345550 0.23036649 0.09162304 0.43455497 
## 
## Group means:
##           age        Medu        Fedu traveltime   studytime   failures
## 1 -0.15150389  0.47524588  0.31793533 -0.2879191  0.34125036 -0.3273272
## 2  0.09100733 -0.26100930 -0.13226160  0.2385657 -0.27429626  0.2974988
## 3  0.64464336 -0.40027789 -0.56793020  0.9170400 -0.58932564  1.5656617
## 4 -0.09928495 -0.04348989  0.01173851 -0.1585163  0.07848304 -0.3044375
##        famrel     freetime       goout      health   absences         G1
## 1  0.06937305 -0.037671026 -0.38924340 -0.20230345 -0.2411265  1.2245145
## 2 -0.33214977  0.007491072  0.22458595  0.11007841  0.2933109 -0.6316473
## 3 -0.11991628  0.327936363  0.56380349  0.08036769  0.8028458 -0.8691426
## 4  0.16249732 -0.052009528 -0.01986174  0.03803886 -0.1896759 -0.1679210
##           G2          G3    alc_use
## 1  1.2196444  1.15119286 -0.4930008
## 2 -0.6511951 -0.66705988  0.4591347
## 3 -0.9430175 -0.98425931  1.4475137
## 4 -0.1392539 -0.08379874 -0.2723962
## 
## Coefficients of linear discriminants:
##                    LD1         LD2          LD3
## age        -0.07064798 -0.10742109 -0.154803905
## Medu       -0.21614373 -0.28946292 -0.292725530
## Fedu       -0.04019076  0.11655127  0.358301628
## traveltime  0.30661476 -0.31082720 -0.012004516
## studytime  -0.03182434 -0.07466985 -0.124071455
## failures    0.38617707 -0.87803992 -0.248719853
## famrel     -0.08948820 -0.04204097 -0.813033662
## freetime   -0.03045788 -0.13204798 -0.033243313
## goout       0.08390191  0.25133500  0.003604871
## health     -0.02027092  0.12273177  0.098613125
## absences    0.37848441 -0.27521538  0.221078909
## G1         -0.36587629 -0.65489022  0.378096333
## G2         -0.25588535 -0.69491291  0.318987655
## G3         -0.63484376  0.34177356 -0.616701007
## alc_use     0.85244671 -0.37363870 -0.003084369
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8474 0.1444 0.0083
#plot
plot(lda.fit_clus, dimen = 2, col = km$cluster, pch= km$cluster)
lda.arrows(lda.fit_clus, myscale = 2)

Here, we can see that going out, free time and absences tend tend to cause high alcohol use. On the other hand, study tume, family relationship tend to reduce it.

3D VISUALISATION OF THE CLUSTERS

model_predictors <- dplyr::select(data_LDA2, -alc_use)

# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

#Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.
library(plotly)
#using the plotly package for 3D plotting of the matrix products' columns.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=data_LDA2$alc_use)